The new Hadoop is nothing less than the Apache Foundation’s attempt to create a whole new general framework for the way big data can be stored, mined, and processed.
Hadoop is the foundation of most big data architectures. The progress from a Hadoop-1’s more restricted processing model of batch oriented MapReduce jobs, to more interactive and specialized processing models of Hadoop-2 will only further position the Hadoop ecosystem as the dominant big data analysis platform.
Hadoop-1 popularized Google’s MapReduce programming concept for batch jobs and demonstrated the potential value of large scale, distributed processing. MapReduce, as implemented in Hadoop 1, can be I/O intensive, unsuitable for interactive analysis. Hadoop developers rewrote major components of the file system to create Hadoop-2.
With Hadoop-2, the JobTracker approach has been scrapped. Instead, Hadoop uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what’s happening on that node.
Two of the most important advances in Hadoop-2 are firstly, the introduction of HDFS federation and secondly the new resource manager YARN ( Yet Another Resource Negotiator).
YARN, the resource manager, was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop-1. YARN is often called the operating system of Hadoop as it is responsible for managing and monitoring system workloads.
HDFS is the Hadoop file system and comprises two major components: namespaces and blocks storage service. The namespace service manages operations on files and directories, such as creating and modifying files and directories.
Another Hadoop vendor, Hortonworks has chosen to go with Apache’s native Hive technology, which is best for data warehouse style operations, involving table joins and merges.