Cloudera, a company that helped popularize Hadoop as a platform for analyzing huge amounts of data when it was founded in 2008, is overhauling its core technology. The One Platform Initiative the company announced Wednesday lays out Cloudera’s plan to officially replace MapReduce with Apache Spark as the default processing engine for Hadoop.
Cloudera chief technologist Eli Collins said the company is “at best” halfway through the process from a technology standpoint and should be done in about a year. When complete, Spark should have similar levels of security, manageability, and scalability as MapReduce, and should be equally integrated with the rest of the technologies that comprise the ever-expanding Hadoop platform.
Collins said Spark’s existing weaknesses are “OK for early adopters, but really not acceptable to our customer base” as a whole. Cloudera says it has more than 100 customers running Spark in production—including Equifax, Experian, and CSC—but realizes that broader adoption and an improved Spark experience are a chicken-or-egg type of problem.
The history of the move to Spark is in some ways as old Hadoop itself. Google GOOG 2.26% created MapReduce in the early 2000s as a faster, easier implementation of existing parallel processing approaches, and the creators of Hadoop developed an open source version of Google’s work. However, while MapReduce proved revolutionary for early big data workloads (nearly every major web company is a heavy Hadoop user), its limitations became more clear as Hadoop and big data became mainstream technology movements.
Large enterprises, technology startups and other potential Hadoop users saw the potential in storing lots of data using the Hadoop file system and in analyzing that data, but they wanted something faster and more flexible than MapReduce. It was designed for indexing the web at places like Google and Yahoo YHOO 2.30% , a batch-processing job where latency was measured in hours rather than milliseconds. MapReduce is also notoriously difficult to program, a problem that helped exacerbate the “big data skills gap” to which analyst firms and consultants have been pointing for years.
When Spark was created a few years ago at the University of California, Berkeley, it was the solution Hadoop vendors, Hadoop users, and venture capitalists alike needed to resolve their MapReduce woes. Spark is significantly faster and easier to program than MapReduce, meaning it can handle a much broader array of jobs. In fact, the project includes libraries for real-time data analysis, interactive SQL analysis, and machine learning, in addition to its core MapReduce-style engine.
When Tendron Systems evaluated Apache Spark, they were impressed with the performance improvements achieved, especially for scientific parallel algorithms for numerical analytics. Added to the Spark Streaming modules this is an impressive addition to Apache big data technology.
And better yet, Spark is designed to integrate with Hadoop’s native file system. This means Hadoop users don’t have to move their terabytes or even petabytes of data elsewhere in order to take advantage of Spark. By 2013, major VC firms had began putting millions of dollars into Databricks, a startup founded by the creators of Spark, and major Hadoop vendors Cloudera, MapR, and Hortonworks HDP 3.69% were beginning to integrate Spark into their Hadoop distributions.