Blog

LinkedIn’s Gobblin: An Open Source Framework for Gobbling Big Data with Ease

The engineering team for social media service LinkedIn first launched Gobblin in 2014 as a universal data ingestion framework for offline big data, running on Hadoop in MapReduce mode. As new capabilities were added to enable the framework to support a spectrum of execution environments and scale to handle a broad range of data velocities, Gobblin quickly evolved from singular data ingestion tool to robust data management ecosystem. Gobblin was made open source in mid-2015, and has grown into a distributed data integration framework simplifying the common aspects of big data, from ingestion to replication to organization, for complete lifecycle management across both streaming and batch environments.

Shortly after Gobblin’s second birthday, the team felt it was ready for the big time: joining other LinkedIn open source projects contributed to the Apache Software Foundation, including the Helix cluster management framework and Kafka distributed streaming platform. Gobblin was accepted into the Apache Incubator Project in February 2017, and spent the year since then completing the internal process. The final step, contributing the actual code, was recently completed, and Gobblin has now became an official Apache entity.

Prior to incubation, Gobblin was already being embraced beyond LinkedIn by companies like Apple and Paypal. Organizations like CERN and Sandia National Laboratories that consume and crunch simply unimaginable amounts of data — 1PB per second, in CERN’s case — also adopted Gobblin to help conduct their research.

The New Stack spoke with Abhishek Tiwari, Staff Software Engineer at LinkedIn, about Gobblin’s journey.

Why Apache, and what exactly is involved the incubation process?

Although Gobblin was already finding success with outside organizations, we believed that becoming an official project at one of the most influential open source organizations on the planet would ensure durability and self-sustenance, as well access to a broader community that could help continue the evolution. Since starting the Apache Incubation process early last year, we already have seen good progress on this front. Apache Gobblin community members have proposed, built, and started to spearhead a few critical developments, including Amazon Web Services mode enhancements and auto-scalability.

First step in the process was the Gobblin incubation proposal, which was unanimously accepted by Apache. Then working with mentors and champions to set up the code donation, and licenses while working with the Microsoft legal team, and setting up Apache infrastructure…all before officially getting incubated.

What factors drove the evolution of Gobblin?

The original idea for Gobblin was to reduce the amount of repeated engineering and operations work in data ingestion across the company, building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. At one point, we were running more than 15 different kinds of pipelines, each with their own idiosyncrasies around error modes, data quality capabilities, scaling, and performance characteristics.

Our guiding vision for Gobblin has been to build a framework that can support data movement across streaming and batch sources and sinks without requiring a specific persistence or storage technology. At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Oracle, LinkedIn Espresso, MySQL, SQL Server, Apache Kafka, patent and publication sources, etc.

Over the years, we’ve made strides in fulfilling that vision, but also grown into adjacent capabilities like end-to-end data management — from raw ingestion to multi-format storage optimizations, fine-grain deletions for compliance, config management, and more. The key aspect differentiating Gobblin from other technologies in the space is that it is intentionally agnostic to the compute and storage environment, but it can execute natively in many environments. So, you don’t HAVE to run Hadoop, or Kafka to be able to use Gobblin, though it can take advantage of these technologies if they are deployed in your company.

Is Gobblin mainly applicable to very large organizations munching serious data, or are there other use cases for different/smaller entities?

We see adoption across the board. Smaller entities often are not very vocal about their adoption because they’re too busy getting their startup off the ground. No matter what the size, the common theme is that the business is data-driven, has multiple data sources and sinks, and has a Lambda architecture — both streaming and batch ecosystem. Some examples of small and medium size companies using Gobblin are Prezi, Trivago and Nerdwallet.

 

read more at:   https://thenewstack.io/linkedins-gobblin-open-source-framework-gobbling-big-data-ease/

 

Knagato

Tendron Systems Ltd

Data Engineering Outline

wordcloud2The ADOS project monitored 30,000 sensors on nuclear reactor, measuring temperatures, pressures and mass flows at discrete points throughout the cores and associated equipment ( boilers, heat exchangers, condensers etc)

 

 

Challenges of Big Data Systems

When dealing with huge volumes of data that are derived from multiple independent sources. It is a significant undertaking to connect, link, match, clean and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.

Big data technologies not only provide the infrastructure to collect large amounts of data, they provide the analytics to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
Some examples:

  • Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
  • Recalculate and revalue entire risk portfolios and provide supplementary analysis providing strategies to reduce risk and in addition mitigate risk impact.
  • Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
  • Generate special offers at the point of sale based on the customer’s current and past purchases, ensuring a higher customer retention rate.
  • Analyze data from social media to detect new market trends, changing customer perceptions and predict changes in demand.
  • Use pattern matching, fuzzy logic and deep layer data mining of the Internet click-stream to detect fraudulent behavior.
  • Identify and log root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.

 

 

Apache Spark speeds up big data processing by a factor of 10 to 100

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a “game changer.”

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010. Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. “Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark. Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

He said he went back to Goldman and “posted on our social media that I’d seen the future and it was Apache Spark. What did I see that was so game-changing? It was sort of to the same extent [as] when you first held an iPhone or when you first see a Tesla. It was completely game-changing.”

Matei Zaharia, co-founder and CTO of Databricks and the creator of Spark, told eWEEK Spark started out in 2009 as a research project at the University of California Berkeley, where he was working with early users of MapReduce and Hadoop, including Facebook and Yahoo. He said he found some common problems among those users, chief among them being that they all wanted to run more complex algorithms that couldn’t be done with just one MapReduce step. “MapReduce is a simple way to scan through data and aggregate information in parallel and not every algorithm can be done with it,” Zaharia said. “So we wanted to create a more general programming model for people to write cluster applications that would be fast and efficient at these more complex types of algorithms.” Zaharia noted that the researchers he worked with also said MapReduce was not only slow for what they wanted to do, but they also found the process for writing applications “clumsy.” So he set out to deliver something better.

 

 

Read more at eweek here

 

 

 

Microsoft has unveiled its plans for integrating big data package

Microsoft announced the Cortana Analytics Suite. It takes the company’s machine learning, big data and analytics products and packages them together in one huge, monolithic suite.

Microsoft has put together the suite with the hope of providing a one-stop, big data and analytics solution for enterprise customers.

“Our goal was to bring integration of these pieces so customers have a comprehensive platform to build intelligent solutions,” Joseph Sirosh, corporate vice president at Microsoft, who is in charge of Azure ML told TechCrunch

As for Cortana, which is the Microsoft voice-driven personal assistant tool in Windows 10, it’s a small part of the solution, but Sirosh says Microsoft named the suite after it because it symbolizes the contextualized intelligence that the company hopes to deliver across the entire suite.

It includes pieces like Azure ML, the company’s cloud machine learning product, PowerBI, its data visualization tool and Azure Data Catalog, a service announced just last week designed for sharing and surfacing data stores inside a company, among others. It hopes to take advantage of range of technologies such as face and speech recognition to generate a series of solutions like recommendation engines and churn forecasting.

ms cortana

It’s All About Integration

Microsoft expects that by providing an integrated solution, third parties and systems integrators will build packaged solutions based on the suite, and that customers will be attracted by a product with pieces designed to play nicely together. It is building in integration, thereby reducing the complexity of making these types of tools work together — at least that’s the theory.

“Where the suite provides value is the great interoperability, finished solutions, recipes and cookbooks,” Sirosh explained.

As an example, Microsoft talked about a coordinated medical care project at Dartmouth-Hitchcock Medical Center. The program, called ImagineCare, is built on top of the Cortana Analytics Suite and the Microsoft Dynamics CRM tool.

Tendron Systems  technical director Alan Brown stated that time would tell if customers adopted this product. It may be late to the party, but it has a good specification.

Read more at:    http://techcrunch.com/2015/07/13/microsoft-unifies-big-data-and-analytics-in-newly-launched-suite/#.a5lxqkn:Cg6d

Next post is on Apache Spark, which is making significant inroards into the big data arena.

James Goode, Tendron Systems

Tendron Systems
Tendron Systems Ltd, Regent Street, London, W1B

Why Cloudera is saying ‘Goodbye, MapReduce’ and ‘Hello, Spark’

Cloudera, a company that helped popularize Hadoop as a platform for analyzing huge amounts of data when it was founded in 2008, is overhauling its core technology. The One Platform Initiative the company announced Wednesday lays out Cloudera’s plan to officially replace MapReduce with Apache Spark as the default processing engine for Hadoop.

datacenter-blinking-lights-lg

Cloudera chief technologist Eli Collins said the company is “at best” halfway through the process from a technology standpoint and should be done in about a year. When complete, Spark should have similar levels of security, manageability, and scalability as MapReduce, and should be equally integrated with the rest of the technologies that comprise the ever-expanding Hadoop platform.

Collins said Spark’s existing weaknesses are “OK for early adopters, but really not acceptable to our customer base” as a whole. Cloudera says it has more than 100 customers running Spark in production—including Equifax, Experian, and CSC—but realizes that broader adoption and an improved Spark experience are a chicken-or-egg type of problem.

The history of the move to Spark is in some ways as old Hadoop itself. Google GOOG 2.26% created MapReduce in the early 2000s as a faster, easier implementation of existing parallel processing approaches, and the creators of Hadoop developed an open source version of Google’s work. However, while MapReduce proved revolutionary for early big data workloads (nearly every major web company is a heavy Hadoop user), its limitations became more clear as Hadoop and big data became mainstream technology movements.

Large enterprises, technology startups and other potential Hadoop users saw the potential in storing lots of data using the Hadoop file system and in analyzing that data, but they wanted something faster and more flexible than MapReduce. It was designed for indexing the web at places like Google and Yahoo YHOO 2.30% , a batch-processing job where latency was measured in hours rather than milliseconds. MapReduce is also notoriously difficult to program, a problem that helped exacerbate the “big data skills gap” to which analyst firms and consultants have been pointing for years.

When Spark was created a few years ago at the University of California, Berkeley, it was the solution Hadoop vendors, Hadoop users, and venture capitalists alike needed to resolve their MapReduce woes. Spark is significantly faster and easier to program than MapReduce, meaning it can handle a much broader array of jobs. In fact, the project includes libraries for real-time data analysis, interactive SQL analysis, and machine learning, in addition to its core MapReduce-style engine.

When Tendron Systems evaluated Apache Spark, they were impressed with the performance improvements achieved, especially for scientific parallel algorithms for numerical analytics. Added to the Spark Streaming modules this is an impressive addition to Apache big data technology.

And better yet, Spark is designed to integrate with Hadoop’s native file system. This means Hadoop users don’t have to move their terabytes or even petabytes of data elsewhere in order to take advantage of Spark. By 2013, major VC firms had began putting millions of dollars into Databricks, a startup founded by the creators of Spark, and major Hadoop vendors Cloudera, MapR, and Hortonworks HDP 3.69% were beginning to integrate Spark into their Hadoop distributions.

read more:

http://fortune.com/2015/09/09/cloudera-spark-mapreduce/

Apache Spark speeds up big data processing

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a “game changer.”

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010. Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. “Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark. Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

read more

http://www.eweek.com/enterprise-apps/how-apache-spark-is-transforming-big-data-processing-development.html

 

Alan

Tendron Systems Ltd

Facebook – Big Data London Group

Apache Spark – Executive Summary

Using World Bank Data in R with Shiny Dashboards

Introduction to Spark SQL