LinkedIn’s Gobblin: An Open Source Framework for Gobbling Big Data with Ease

The engineering team for social media service LinkedIn first launched Gobblin in 2014 as a universal data ingestion framework for offline big data, running on Hadoop in MapReduce mode. As new capabilities were added to enable the framework to support a spectrum of execution environments and scale to handle a broad range of data velocities, Gobblin quickly evolved from singular data ingestion tool to robust data management ecosystem. Gobblin was made open source in mid-2015, and has grown into a distributed data integration framework simplifying the common aspects of big data, from ingestion to replication to organization, for complete lifecycle management across both streaming and batch environments.

Shortly after Gobblin’s second birthday, the team felt it was ready for the big time: joining other LinkedIn open source projects contributed to the Apache Software Foundation, including the Helix cluster management framework and Kafka distributed streaming platform. Gobblin was accepted into the Apache Incubator Project in February 2017, and spent the year since then completing the internal process. The final step, contributing the actual code, was recently completed, and Gobblin has now became an official Apache entity.

Prior to incubation, Gobblin was already being embraced beyond LinkedIn by companies like Apple and Paypal. Organizations like CERN and Sandia National Laboratories that consume and crunch simply unimaginable amounts of data — 1PB per second, in CERN’s case — also adopted Gobblin to help conduct their research.

The New Stack spoke with Abhishek Tiwari, Staff Software Engineer at LinkedIn, about Gobblin’s journey.

Why Apache, and what exactly is involved the incubation process?

Although Gobblin was already finding success with outside organizations, we believed that becoming an official project at one of the most influential open source organizations on the planet would ensure durability and self-sustenance, as well access to a broader community that could help continue the evolution. Since starting the Apache Incubation process early last year, we already have seen good progress on this front. Apache Gobblin community members have proposed, built, and started to spearhead a few critical developments, including Amazon Web Services mode enhancements and auto-scalability.

First step in the process was the Gobblin incubation proposal, which was unanimously accepted by Apache. Then working with mentors and champions to set up the code donation, and licenses while working with the Microsoft legal team, and setting up Apache infrastructure…all before officially getting incubated.

What factors drove the evolution of Gobblin?

The original idea for Gobblin was to reduce the amount of repeated engineering and operations work in data ingestion across the company, building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. At one point, we were running more than 15 different kinds of pipelines, each with their own idiosyncrasies around error modes, data quality capabilities, scaling, and performance characteristics.

Our guiding vision for Gobblin has been to build a framework that can support data movement across streaming and batch sources and sinks without requiring a specific persistence or storage technology. At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Oracle, LinkedIn Espresso, MySQL, SQL Server, Apache Kafka, patent and publication sources, etc.

Over the years, we’ve made strides in fulfilling that vision, but also grown into adjacent capabilities like end-to-end data management — from raw ingestion to multi-format storage optimizations, fine-grain deletions for compliance, config management, and more. The key aspect differentiating Gobblin from other technologies in the space is that it is intentionally agnostic to the compute and storage environment, but it can execute natively in many environments. So, you don’t HAVE to run Hadoop, or Kafka to be able to use Gobblin, though it can take advantage of these technologies if they are deployed in your company.

Is Gobblin mainly applicable to very large organizations munching serious data, or are there other use cases for different/smaller entities?

We see adoption across the board. Smaller entities often are not very vocal about their adoption because they’re too busy getting their startup off the ground. No matter what the size, the common theme is that the business is data-driven, has multiple data sources and sinks, and has a Lambda architecture — both streaming and batch ecosystem. Some examples of small and medium size companies using Gobblin are Prezi, Trivago and Nerdwallet.


read more at:



Tendron Systems Ltd