“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
- Hadoop Common – libraries and utilities needed by other modules
- Hadoop Distributed File System (HDFS) – stores data providing very high aggregate bandwidth across the entire cluster
- Hadoop Yarn – managing resources and scheduling users and applications
- Map Reduce – programming model that scales data across the processes
Hadoop Distributed File System
Distributed scalable and portable file system written in Java for the Hadoop Framework. Includes secondary NameNode which connects to primary NameNode and builds snap shots of directory structure.
Different ways to submit and track jobs. Overhaulled in Hadoop 2.3 MRV2 referred to as YARN. Split up the major functionality of job tracker, resource manager and scheduler. YARN is compatible with MapReduce.
Started as Google Bigtable and MapReduce / Google File System (GFS). Wanted to be able to access the data with a SQL style language. Facebook, LinkedIn and Yahoo stack all had some similar products in their stacks.