Hadoop

Introduction

A lot of problems arise while dealing with massive amounts of data. The data can be “massive” in terms of at least one of the 5 V’s, which are: Volume, Velocity, Variety, Veracity, and Value. A very effective way to deal with these problems is to use a number of clusters in a distributed way, i.e., a multi-node setup. This helps in carrying out tasks, like data storing and processing, more quickly through parallelization. Hadoop is an open source framework designed to handle such massive amounts of data in a distributed and scalable way. Although Hadoop is less commonly used in the industry today, understanding it is crucial to understand the newer technologies which replaced it, e.g., Spark.

The Hadoop distributed file system, introduced by (Shvachko et al., 2010), was inspired by the Google file system that was introduced by (Ghemawat et al., 2003). The basic idea is to connect many readily available and inexpensive machines, also known as commodity hardware, together in a cluster to share and process the data. Hadoop is a framework that helps in doing this.

Properties of Hadoop

Some major properties of Hadoop are the following:

  • Scalability:
    • Hadoop can scale horizontally. In other words, we can, in principle, add as many machines as we like to a Hadoop cluster.
  • Fault tolerance:
    • To deal with faults, i.e., machine failure, Hadoop maintains copies of the data, also known as replicas. Thus, even if a machine fails, the data remains available through replicas.
  • Distributed processing:
    • Hadoop can process the data where it is stored. Instead of moving the data close to where the processing happens, Hadoop instead moves the processing closer to where the data is stored. This improves the overall efficiency of the system.
  • Cost effectiveness:
    • As mentioned earlier, Hadoop uses commodity hardware which makes it highly cost effective.
  • Open source:
    • Hadoop is free to use and modify, under the Apache license.

It was majorly due to these properties that Hadoop was widely adopted.

Hadoop Ecosystem

The Hadoop ecosystem is a collection of open source projects, tools, and components that work together to store, process, and analyze large amounts of data. Some components were present in Hadoop when it was introduced, while some were added later. Some of these components that I will discuss are shown in Figure 1.

Hadoop components.
Figure 1: Hadoop components.

Hadoop Distributed File System (HDFS)

HDFS was inspired by the Google file system (GFS). HDFS provides distributed storage for massive data.

MapReduce

The Hadoop MapReduce was inspired by the Google MapReduce which was introduced by (Dean & Ghemawat, 2008). It helps in processing massive data by dividing the entire processing task into smaller tasks that can run in parallel. Today, Spark is more commonly used instead of MapReduce.

Yet Another Resource Negotiator (YARN)

Before YARN was introduced, MapReduce used to handle all the resources corresponding to a job. YARN decouples the resource management from the application execution.

HDFS, MapReduce, and YARN are available by default in every Hadoop installation. These are the core components. The components I will mention after these were added later by others.

Hive

Hive, introduced by (Thusoo et al., 2009), is a query engine. It is not a database. Earlier, anyone wanting to carry out any preprocessing task on the data was required to implement it in the form of MapReduce in Java, which was difficult. Hive gives us a SQL like query engine, allowing us to avoid Java interaction but retaining the performance of MapReduce. So, Hive abstracts MapReduce by translating SQL queries into MapReduce jobs.

Pig

Pig allows us to work with the data in a high-level scripting language called Pig Latin. This was introduced by (Olston et al., 2008). Just like Hive, Pig also abstracts MapReduce, but over Pig Latin instead of SQL.

Sqoop

Traditionally, data is stored in relational databases like Oracle, MySQL, etc. Sqoop facilitates import/export between Hadoop and these relational databases.

Oozie

Oozie, introduced by (Islam et al., 2012), is, again, an abstraction over MapReduce that helps in scheduling and automating complex workflows using XML files.

HBase

HBase, inspired by Google Bigtable which was introduced by (Chang et al., 2008), is a columnar NoSQL database that allows real-time reads and writes on HDFS.

Mahout

Mahout is a machine learning framework that allows us to implement machine learning algorithms on the stored data.

Flume

The use case for Flume is real-time data streaming. It is a messaging queue that helps in retrieving the data in real-time. For instance, it can collect logs or event data from various sources and deliver them to Hadoop or Hive. It is used for real time analytics and monitoring, and can also be used for data ingestion.

ZooKeeper

ZooKeeper, introduced by (Hunt et al., 2010), coordinates the distributed system to maintain consistency across all the machines. This is critical for ensuring reliability in Hadoop clusters.

Summary

The functionalities and the components implementing those functionalities are the following:

  • Storage:
    • HDFS, HBase
  • Processing:
    • MapReduce, Pig, Hive, Spark
  • Data ingestion:
    • Flume, Sqoop
  • Coordination:
    • ZooKeeper
  • Workflow management:
    • Oozie
  • Machine Learning:
    • Mahout

Two More Properties of Hadoop

Another two important properties of Hadoop are the following:

  • Hadoop is a loosely coupled framework:
    • In other words, individual components like Hive, Pig, or Mahout can be added or removed without affecting the core functionalities of HDFS and YARN.
  • Hadoop is easily integrable:
    • Hadoop can be integrated with both big data tools like Spark and traditional technologies like relational databases.

Many of the components that I mentioned above have been replaced by others. For example, Spark is used instead of MapReduce, Kubernetes can be used instead of YARN, Snowflake can be used instead of Hive, Airflow can be used for scheduling, etc. However, the former form the building blocks of the latter.

References

  1. The hadoop distributed file system
    Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and 1 more author
    In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010
  2. The Google file system
    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
    In Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003
  3. MapReduce: simplified data processing on large clusters
    Jeffrey Dean and Sanjay Ghemawat
    Communications of the ACM, 2008
  4. Hive: a warehousing solution over a map-reduce framework
    Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, and 6 more authors
    Proceedings of the VLDB Endowment, 2009
  5. Pig latin: a not-so-foreign language for data processing
    Christopher Olston, Benjamin Reed, Utkarsh Srivastava, and 2 more authors
    In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008
  6. Oozie: towards a scalable workflow management system for hadoop
    Mohammad Islam, Angelo K Huang, Mohamed Battisha, and 5 more authors
    In Proceedings of the 1st ACM SIGMOD workshop on scalable workflow execution engines and technologies, 2012
  7. Bigtable: A distributed storage system for structured data
    Fay Chang, Jeffrey Dean, Sanjay Ghemawat, and 6 more authors
    ACM Transactions on Computer Systems (TOCS), 2008
  8. ZooKeeper: Wait-free coordination for internet-scale systems
    Patrick Hunt, Mahadev Konar, Flavio P Junqueira, and 1 more author
    In 2010 USENIX Annual Technical Conference (USENIX ATC 10), 2010



    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Hadoop Distributed File System
  • Network of Perceptrons (MLP)
  • Perceptron
  • McCulloch-Pitts Neuron