Introduction

Hadoop Distributed File System (HDFS) is a key component of the Hadoop ecosystem. It is designed to store and manage large scale dataset across a distributed environment, making it highly scalable and fault tolerant. Unlike traditional file system, HDFS is optimized for big data, where the files are massive and need to be processed efficiently.

Some Basic Terminologies

Before diving deeper, let us clarify some basic terms that are ubiquitous in Hadoop.

File System

A File System (FS) is nothing but a data structure an operating system (OS) uses to manage files on a storage device. It is a layer between the software (OS) and the hardware (HDD). How files are stored will depend on which FS is being used. A few examples are the following.

Linux:
- It uses the ext FS. This FS is very efficient for servers and applications that require speed and security.
Windows:
- The newer Windows computers use NTFS, whereas the older ones used the FAT32 FS. These FS’s are very user friendly as they support a graphical user interface.
Mac OS:
- It uses APFS.
Hadoop:
- As discussed earlier, this has HDFS. A special property of HDFS is that it is distributed.

The same machine can have multiple FS’s, and the data can be exchanged between them. For instance, we can plug in a thumb drive with an NTFS into a Linux machine with an ext FS. We will later also see that multiple Linux machines can be connected together with Hadoop installed for distributed computing. In this case, the machines have ext as well as HDFS.

Block

A block is the smallest unit of data storage in an FS. Whenever a file is stored, it is not stored as a single file in the hard drive. It is divided into blocks. This is done for efficient storage and retrieval of the data in the files. The block sizes are different for different FS’s. For instance, in a Windows system with NTFS, the block size is 4 KB. So, NTFS will store a 10 KB file in 3 blocks. In HDFS, the default block size 128 MB. The block size is significantly large in HDFS because it is designed to work with big data.

Types of File System

Standalone

In a standalone FS, e.g., ext, NTFS, etc., files are stored and managed in a single machine. For instance, if multiple NTFS Windows computers are connected, files are stored in one machine only.

Distributed

In a distributed FS, e.g., HDFS, files are shared across multiple machines in a cluster. This makes the system highly scalable as more machines can be added later to increase the storage indefinitely.

Clusters and Nodes

Figure 1 makes the difference between a cluster and a node very clear.

Each machine is a node, and a cluster is a connection of machines.

Process and Daemon Process

A process is just a program in execution, whereas a daemon process is a background process that runs without any user intervention. In Hadoop, many daemon processes are run to make sure that the machines know their role. This will become more clear as we dive deeper into the architecture.

Metadata

Metadata is data about data. For instance, an image of a dog contains data about that dog. Along with this data, the image will also have metadata like the size of the image, date of creation, date of modification, permissions, file format, etc.

Replication

As the name suggests, replication is making copies of the data to ensure fault tolerance. So, even if the data is deleted, we can restore it using its replica.

Why HDFS?

Consider the analogy of a library and a librarian. The librarian has to manage the library that contains thousands of books. Storing all the books in a single room will lead to space and accessibility issues. A better approach, obviously, is to distribute the books across multiple rooms and having an assistant for each room who is aware of the location of all the books in that room. Further, as a library has multiple copies of the same books, it is a good idea to not store all the copies of a particular book in a single room. This will help retain all the books even if all the books in a particular room are permanently damaged, or are unavailable. This is the essence of HDFS.

It distributes data across multiple nodes.
It replicates the data to ensure fault tolerance.
It provides efficient access for big data processing.

HDFS Architecture

Understanding HDFS is crucial as it is a building block towards other big data distributed storage technologies. We have a cluster of nodes with a Hadoop setup. In distributed computing, we have a master-worker architecture. We need to specify which machine is playing the role of the master, and which machines are playing the role of workers. Once the roles are specified, all the workers are connected to the master as opposed to being connected to themselves. Figure 2 illustrates this.

As the nodes have ext as well as HDFS, both Linux and Hadoop commands will work on them. However, Linux commands run on a particular node will only run on that node, however Hadoop commands can run on all the nodes together.

Note that in HDFS, the master node is called the NameNode and the worker node is called the DataNode.

(Writing in Progress…)

Hadoop Distributed File System