Hadoop
Introduction
A lot of problems arise while dealing with massive amounts of data. The data can be “massive” in terms of at least one of the 5 V’s, which are: Volume, Velocity, Variety, Veracity, and Value. A very effective way to deal with these problems is to use a number of clusters in a distributed way, i.e., a multi-node setup. This helps in carrying out tasks, like data storing and processing, more quickly through parallelization. Hadoop is an open source framework designed to handle such massive amounts of data in a distributed and scalable way. Although Hadoop is less commonly used in the industry today, understanding it is crucial to understand the newer technologies which replaced it, e.g., Spark.
The Hadoop distributed file system, introduced by
Properties of Hadoop
Some major properties of Hadoop are the following:
- Scalability:
- Hadoop can scale horizontally. In other words, we can, in principle, add as many machines as we like to a Hadoop cluster.
- Fault tolerance:
- To deal with faults, i.e., machine failure, Hadoop maintains copies of the data, also known as replicas. Thus, even if a machine fails, the data remains available through replicas.
- Distributed processing:
- Hadoop can process the data where it is stored. Instead of moving the data close to where the processing happens, Hadoop instead moves the processing closer to where the data is stored. This improves the overall efficiency of the system.
- Cost effectiveness:
- As mentioned earlier, Hadoop uses commodity hardware which makes it highly cost effective.
- Open source:
- Hadoop is free to use and modify, under the Apache license.
It was majorly due to these properties that Hadoop was widely adopted.
Hadoop Ecosystem
The Hadoop ecosystem is a collection of open source projects, tools, and components that work together to store, process, and analyze large amounts of data. Some components were present in Hadoop when it was introduced, while some were added later. Some of these components that I will discuss are shown in Figure 1.

Hadoop Distributed File System (HDFS)
HDFS was inspired by the Google file system (GFS). HDFS provides distributed storage for massive data.
MapReduce
The Hadoop MapReduce was inspired by the Google MapReduce which was introduced by
Yet Another Resource Negotiator (YARN)
Before YARN was introduced, MapReduce used to handle all the resources corresponding to a job. YARN decouples the resource management from the application execution.
HDFS, MapReduce, and YARN are available by default in every Hadoop installation. These are the core components. The components I will mention after these were added later by others.
Hive
Hive, introduced by
Pig
Pig allows us to work with the data in a high-level scripting language called Pig Latin. This was introduced by
Sqoop
Traditionally, data is stored in relational databases like Oracle, MySQL, etc. Sqoop facilitates import/export between Hadoop and these relational databases.
Oozie
Oozie, introduced by
HBase
HBase, inspired by Google Bigtable which was introduced by
Mahout
Mahout is a machine learning framework that allows us to implement machine learning algorithms on the stored data.
Flume
The use case for Flume is real-time data streaming. It is a messaging queue that helps in retrieving the data in real-time. For instance, it can collect logs or event data from various sources and deliver them to Hadoop or Hive. It is used for real time analytics and monitoring, and can also be used for data ingestion.
ZooKeeper
ZooKeeper, introduced by
Summary
The functionalities and the components implementing those functionalities are the following:
- Storage:
- HDFS, HBase
- Processing:
- MapReduce, Pig, Hive, Spark
- Data ingestion:
- Flume, Sqoop
- Coordination:
- ZooKeeper
- Workflow management:
- Oozie
- Machine Learning:
- Mahout
Two More Properties of Hadoop
Another two important properties of Hadoop are the following:
- Hadoop is a loosely coupled framework:
- In other words, individual components like Hive, Pig, or Mahout can be added or removed without affecting the core functionalities of HDFS and YARN.
- Hadoop is easily integrable:
- Hadoop can be integrated with both big data tools like Spark and traditional technologies like relational databases.
Many of the components that I mentioned above have been replaced by others. For example, Spark is used instead of MapReduce, Kubernetes can be used instead of YARN, Snowflake can be used instead of Hive, Airflow can be used for scheduling, etc. However, the former form the building blocks of the latter.
References
- The hadoop distributed file systemIn 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010
- The Google file systemIn Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003
- MapReduce: simplified data processing on large clustersCommunications of the ACM, 2008
- Hive: a warehousing solution over a map-reduce frameworkProceedings of the VLDB Endowment, 2009
- Pig latin: a not-so-foreign language for data processingIn Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008
- Oozie: towards a scalable workflow management system for hadoopIn Proceedings of the 1st ACM SIGMOD workshop on scalable workflow execution engines and technologies, 2012
- Bigtable: A distributed storage system for structured dataACM Transactions on Computer Systems (TOCS), 2008
- ZooKeeper: Wait-free coordination for internet-scale systemsIn 2010 USENIX Annual Technical Conference (USENIX ATC 10), 2010
Enjoy Reading This Article?
Here are some more articles you might like to read next: