LinkedIn’s Gobblin: An Open Source Framework for Gobbling Big Data with Ease

The engineering team for social media service LinkedIn first launched Gobblin in 2014 as a universal data ingestion framework for offline big data, running on Hadoop in MapReduce mode. As new capabilities were added to enable the framework to support a spectrum of execution environments and scale to handle a broad range of data velocities, Gobblin quickly evolved from singular data ingestion tool to robust data management ecosystem. Gobblin was made open source in mid-2015, and has grown into a distributed data integration framework simplifying the common aspects of big data, from ingestion to replication to organization, for complete lifecycle management across both streaming and batch environments.

Shortly after Gobblin’s second birthday, the team felt it was ready for the big time: joining other LinkedIn open source projects contributed to the Apache Software Foundation, including the Helix cluster management framework and Kafka distributed streaming platform. Gobblin was accepted into the Apache Incubator Project in February 2017, and spent the year since then completing the internal process. The final step, contributing the actual code, was recently completed, and Gobblin has now became an official Apache entity.

Prior to incubation, Gobblin was already being embraced beyond LinkedIn by companies like Apple and Paypal. Organizations like CERN and Sandia National Laboratories that consume and crunch simply unimaginable amounts of data — 1PB per second, in CERN’s case — also adopted Gobblin to help conduct their research.

The New Stack spoke with Abhishek Tiwari, Staff Software Engineer at LinkedIn, about Gobblin’s journey.

Why Apache, and what exactly is involved the incubation process?

Although Gobblin was already finding success with outside organizations, we believed that becoming an official project at one of the most influential open source organizations on the planet would ensure durability and self-sustenance, as well access to a broader community that could help continue the evolution. Since starting the Apache Incubation process early last year, we already have seen good progress on this front. Apache Gobblin community members have proposed, built, and started to spearhead a few critical developments, including Amazon Web Services mode enhancements and auto-scalability.

First step in the process was the Gobblin incubation proposal, which was unanimously accepted by Apache. Then working with mentors and champions to set up the code donation, and licenses while working with the Microsoft legal team, and setting up Apache infrastructure…all before officially getting incubated.

What factors drove the evolution of Gobblin?

The original idea for Gobblin was to reduce the amount of repeated engineering and operations work in data ingestion across the company, building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. At one point, we were running more than 15 different kinds of pipelines, each with their own idiosyncrasies around error modes, data quality capabilities, scaling, and performance characteristics.

Our guiding vision for Gobblin has been to build a framework that can support data movement across streaming and batch sources and sinks without requiring a specific persistence or storage technology. At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Oracle, LinkedIn Espresso, MySQL, SQL Server, Apache Kafka, patent and publication sources, etc.

Over the years, we’ve made strides in fulfilling that vision, but also grown into adjacent capabilities like end-to-end data management — from raw ingestion to multi-format storage optimizations, fine-grain deletions for compliance, config management, and more. The key aspect differentiating Gobblin from other technologies in the space is that it is intentionally agnostic to the compute and storage environment, but it can execute natively in many environments. So, you don’t HAVE to run Hadoop, or Kafka to be able to use Gobblin, though it can take advantage of these technologies if they are deployed in your company.

Is Gobblin mainly applicable to very large organizations munching serious data, or are there other use cases for different/smaller entities?

We see adoption across the board. Smaller entities often are not very vocal about their adoption because they’re too busy getting their startup off the ground. No matter what the size, the common theme is that the business is data-driven, has multiple data sources and sinks, and has a Lambda architecture — both streaming and batch ecosystem. Some examples of small and medium size companies using Gobblin are Prezi, Trivago and Nerdwallet.


read more at:



Tendron Systems Ltd

Challenges of Big Data Systems

When dealing with huge volumes of data that are derived from multiple independent sources. It is a significant undertaking to connect, link, match, clean and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.

Big data technologies not only provide the infrastructure to collect large amounts of data, they provide the analytics to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
Some examples:

  • Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
  • Recalculate and revalue entire risk portfolios and provide supplementary analysis providing strategies to reduce risk and in addition mitigate risk impact.
  • Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
  • Generate special offers at the point of sale based on the customer’s current and past purchases, ensuring a higher customer retention rate.
  • Analyze data from social media to detect new market trends, changing customer perceptions and predict changes in demand.
  • Use pattern matching, fuzzy logic and deep layer data mining of the Internet click-stream to detect fraudulent behavior.
  • Identify and log root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.



Apache Spark speeds up big data processing by a factor of 10 to 100

Apache Spark speeds up big data processing by a factor of 10 to 100 and simplifies app development to such a degree that developers call it a “game changer.”

Apache Spark has been called a game changer and perhaps the most significant open source project of the next decade, and it’s been taking the big data world by storm since it was open sourced in 2010. Apache Spark is an open source data processing engine built for speed, ease of use and sophisticated analytics. Spark is designed to perform both batch processing and new workloads like streaming, interactive queries, and machine learning. “Spark is undoubtedly a force to be reckoned with in the big data ecosystem,” said Beth Smith, general manager of the Analytics Platform for IBM Analytics. IBM has invested heavily in Spark. Meanwhile, in a talk at the Spark Summit East 2015, Matthew Glickman, a managing director at Goldman Sachs, said he realized Spark was something special when he attended last year’s Strata + Hadoop World conference in New York.

He said he went back to Goldman and “posted on our social media that I’d seen the future and it was Apache Spark. What did I see that was so game-changing? It was sort of to the same extent [as] when you first held an iPhone or when you first see a Tesla. It was completely game-changing.”

Matei Zaharia, co-founder and CTO of Databricks and the creator of Spark, told eWEEK Spark started out in 2009 as a research project at the University of California Berkeley, where he was working with early users of MapReduce and Hadoop, including Facebook and Yahoo. He said he found some common problems among those users, chief among them being that they all wanted to run more complex algorithms that couldn’t be done with just one MapReduce step. “MapReduce is a simple way to scan through data and aggregate information in parallel and not every algorithm can be done with it,” Zaharia said. “So we wanted to create a more general programming model for people to write cluster applications that would be fast and efficient at these more complex types of algorithms.” Zaharia noted that the researchers he worked with also said MapReduce was not only slow for what they wanted to do, but they also found the process for writing applications “clumsy.” So he set out to deliver something better.



Read more at eweek here




Microsoft has unveiled its plans for integrating big data package

Microsoft announced the Cortana Analytics Suite. It takes the company’s machine learning, big data and analytics products and packages them together in one huge, monolithic suite.

Microsoft has put together the suite with the hope of providing a one-stop, big data and analytics solution for enterprise customers.

“Our goal was to bring integration of these pieces so customers have a comprehensive platform to build intelligent solutions,” Joseph Sirosh, corporate vice president at Microsoft, who is in charge of Azure ML told TechCrunch

As for Cortana, which is the Microsoft voice-driven personal assistant tool in Windows 10, it’s a small part of the solution, but Sirosh says Microsoft named the suite after it because it symbolizes the contextualized intelligence that the company hopes to deliver across the entire suite.

It includes pieces like Azure ML, the company’s cloud machine learning product, PowerBI, its data visualization tool and Azure Data Catalog, a service announced just last week designed for sharing and surfacing data stores inside a company, among others. It hopes to take advantage of range of technologies such as face and speech recognition to generate a series of solutions like recommendation engines and churn forecasting.

ms cortana

It’s All About Integration

Microsoft expects that by providing an integrated solution, third parties and systems integrators will build packaged solutions based on the suite, and that customers will be attracted by a product with pieces designed to play nicely together. It is building in integration, thereby reducing the complexity of making these types of tools work together — at least that’s the theory.

“Where the suite provides value is the great interoperability, finished solutions, recipes and cookbooks,” Sirosh explained.

As an example, Microsoft talked about a coordinated medical care project at Dartmouth-Hitchcock Medical Center. The program, called ImagineCare, is built on top of the Cortana Analytics Suite and the Microsoft Dynamics CRM tool.

Tendron Systems  technical director Alan Brown stated that time would tell if customers adopted this product. It may be late to the party, but it has a good specification.

Read more at:

Next post is on Apache Spark, which is making significant inroards into the big data arena.

James Goode, Tendron Systems

Tendron Systems
Tendron Systems Ltd, Regent Street, London, W1B

IBM Backs Apache Spark For Big Data Applications

Technology giant IBM has thrown its full weight behind Spark, Apache’s open-source cluster computing framework.

Spark will form the basis of all of Big Blue’s analytics and commerce platforms and its Watson Health Cloud. The framework will also be sold as a service on its Bluemix cloud.

IBM will commit more than 3,500 of its researchers and developers to Spark-related projects and promised a Spark Technology Center in San Francisco, California where data science and developers can work with IBM designers and architects.

Spark began life in as a project at UC Berkeley in California, quickly delivering in-memory performance as much as 100 times that of the MapReduce framework that originally underpinned Apache Hadoop. Hadoop has moved on since then, to adopt other — faster and more flexible — ways of working. Spark has also progressed, promoting increasingly capable disk-based performance to complement its in-memory strengths, and establishing itself as a strong contender for use particularly in machine learning tasks. Spark moved to the Apache Software Foundation in 2013, becoming a top level project in 2014. In 2013, members of the original Berkeley team established the company now known as Databricks to construct a business around Apache Spark. The company launched with almost $14 million dollars from Andreessen Horowitz and others, and secured a further $33 million a year ago. Nevertheless, Spark is not without competitors of its own. Flink  also a top-level project of the Apache Software Foundation, has just  begun to attract many of the same admiring comments directed Spark’s way 12-18 months ago. Despite sound technical credentials, ongoing development, big investments, and today’s high-profile endorsement from IBM, it would be  premature to crown Spark as the winner just yet.

Written in Java, Scala and Python, Spark is an in-memory system for processing large data sets. It consists of scheduling and dispatching, SQL-style programming language, a machine-learning framework and distributed graphics processing framework.

Several key technology companies are likely to invest in their spark infrastructure as a direct result of IBM’s initiative, including Databricks, Tendron Systems, and major consultancies.

Spark can scale to more than 8,000 production nodes and, while it works with Hadoop and MapReduce, is claimed to also be substantially faster on certain workloads.

read more:

Facebook reveals news feed experiment to control emotions

Protests over secret study involving 689,000 users in which friends’ postings were moved to influence moods

Poll: Facebook’s secret mood experiment: have you lost trust in the social network?


It already knows whether you are single or dating, the first school you went to and whether you like or loathe Justin Bieber. But now Facebook, the world’s biggest social networking site, is facing a storm of protest after it revealed it had discovered how to make users feel happier or sadder with a few computer key strokes.

It has published details of a vast experiment in which it manipulated information posted on 689,000 users’ home pages and found it could make people feel more positive or negative through a process of “emotional contagion”.

In a study with academics from Cornell and the University of California, Facebook filtered users’ news feeds – the flow of comments, videos, pictures and web links posted by other people in their social network. One test reduced users’ exposure to their friends’ “positive emotional content”, resulting in fewer positive posts of their own. Another test reduced exposure to “negative emotional content” and the opposite happened.

The study concluded: “Emotions expressed by friends, via online social networks, influence our own moods, constituting, to our knowledge, the first experimental evidence for massive-scale emotional contagion via social networks.”

Lawyers, internet activists and politicians said this weekend that the mass experiment in emotional manipulation was “scandalous”, “spooky” and “disturbing”.

On Sunday evening, a senior British MP called for a parliamentary investigation into how Facebook and other social networks manipulated emotional and psychological responses of users by editing information supplied to them.

Jim Sheridan, a member of the Commons media select committee, said the experiment was intrusive. “This is extraordinarily powerful stuff and if there is not already legislation on this, then there should be to protect people,” he said. “They are manipulating material from people’s personal lives and I am worried about the ability of Facebook and others to manipulate people’s thoughts in politics or other areas. If people are being thought-controlled in this kind of way there needs to be protection and they at least need to know about it.”

A Facebook spokeswoman said the research, published this month in the journal of the Proceedings of the National Academy of Sciences in the US, was carried out “to improve our services and to make the content people see on Facebook as relevant and engaging as possible”.

She said: “A big part of this is understanding how people respond to different types of content, whether it’s positive or negative in tone, news from friends, or information from pages they follow.”

But other commentators voiced fears that the process could be used for political purposes in the runup to elections or to encourage people to stay on the site by feeding them happy thoughts and so boosting advertising revenues.

In a series of Twitter posts, Clay Johnson, the co-founder of Blue State Digital, the firm that built and managed Barack Obama’s online campaign for the presidency in 2008, said: “The Facebook ‘transmission of anger’ experiment is terrifying.”

He asked: “Could the CIA incite revolution in Sudan by pressuring Facebook to promote discontent? Should that be legal? Could Mark Zuckerberg swing an election by promoting Upworthy [a website aggregating viral content] posts two weeks beforehand? Should that be legal?”

It was claimed that Facebook may have breached ethical and legal guidelines by not informing its users they were being manipulated in the experiment, which was carried out in 2012.

The study said altering the news feeds was “consistent with Facebook’s data use policy, to which all users agree prior to creating an account on Facebook, constituting informed consent for this research”.

But Susan Fiske, the Princeton academic who edited the study, said she was concerned. “People are supposed to be told they are going to be participants in research and then agree to it and have the option not to agree to it without penalty.”

James Grimmelmann, professor of law at Maryland University, said Facebook had failed to gain “informed consent” as defined by the US federal policy for the protection of human subjects, which demands explanation of the purposes of the research and the expected duration of the subject’s participation, a description of any reasonably foreseeable risks and a statement that participation is voluntary. “This study is a scandal because it brought Facebook’s troubling practices into a realm – academia – where we still have standards of treating people with dignity and serving the common good,” he said on his blog.

It is not new for internet firms to use algorithms to select content to show to users and Jacob Silverman, author of Terms of Service: Social Media, Surveillance, and the Price of Constant Connection, told Wire magazine on Sunday the internet was already “a vast collection of market research studies; we’re the subjects”.

“What’s disturbing about how Facebook went about this, though, is that they essentially manipulated the sentiments of hundreds of thousands of users without asking permission,” he said. “Facebook cares most about two things: engagement and advertising. If Facebook, say, decides that filtering out negative posts helps keep people happy and clicking, there’s little reason to think that they won’t do just that. As long as the platform remains such an important gatekeeper – and their algorithms utterly opaque – we should be wary about the amount of power and trust we delegate to it.”

Read More at:

James Goode

Tendron Systems Ltd

There remain serious issues of privacy that remain to be adequately addressed. It appears that we need to see more legal expertise made available to study issues created by the use of big data on social media.

Hadoop’s big leap forward

The new Hadoop is nothing less than the Apache Foundation’s attempt to create a whole new general framework for the way big data can be stored, mined, and processed.

Hadoop is the foundation of most big data architectures. The progress from a Hadoop-1’s more restricted processing model of batch oriented MapReduce jobs, to more interactive and specialized processing models of Hadoop-2 will only further position the Hadoop ecosystem as the dominant big data analysis platform.

Hadoop-1 popularized Google’s MapReduce programming concept for batch jobs and demonstrated the potential value of large scale, distributed processing. MapReduce, as implemented in Hadoop 1, can be I/O intensive, unsuitable for interactive analysis. Hadoop developers rewrote major components of the file system to create Hadoop-2.

With Hadoop-2, the JobTracker approach has been scrapped. Instead, Hadoop uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what’s happening on that node.

Two of the most important advances in Hadoop-2 are firstly, the introduction of HDFS federation and  secondly the new resource manager YARN ( Yet Another Resource Negotiator).

YARN, the resource manager, was  created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop-1. YARN is often called the operating system of Hadoop as it is responsible for managing and monitoring system workloads.

HDFS is the Hadoop file system and comprises two major components: namespaces and blocks storage service. The namespace service manages operations on files and directories, such as creating and modifying files and directories.

Another Hadoop vendor, Hortonworks has chosen to go with Apache’s native Hive technology, which is best for data warehouse style operations, involving table joins and merges.

More info:



Tendron Systems