Recent Posts

Saturday, 6 January 2018

HDFS Tutorial

What is File system?
     A file system is used to control how data is stored and retrieved. There are many different kinds of file systems. Each one has different structure and logic, properties of speed, flexibility, security, size and more.

What is Distributed file system?
     File systems that manage the storage across a network of machines are called?distributed file systems. A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server. A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network.

What is HDFS?
  • Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.
  • Hadoop Distributed Filesystem – HDFS is the world’s most reliable storage system. 
  • HDFS is a distributed file system which provides storage in Hadoop in a distributed fashion. 
  • HDFS is a Filesystem of Hadoop designed for storing very large files running on a cluster of commodity hardware (non-expensive, low-end hardware, used for daily purpose). 
  • It is designed on principle of storage of less number of large files rather than the huge number of small files. 
  • HDFS provides redundant storage space for storing files which are huge in sizes; files which are in the range of Terabytes and Petabytes.
  • Files are broken into blocks and distributed across nodes in a cluster. After that each block is replicated, means copies of blocks are created on different machines. Hence if a machine goes down or gets crashed, then also we can easily retrieve and access our data from different machines. By default, 3 copies of a file are created on different machines. Hence it is highly fault-tolerant. 
  • It stores data reliably even in the case of hardware failure. 
  • HDFS provides faster file read and writes mechanism, as data is stored in different nodes in a cluster. Hence the user can easily access the data from any machine in a cluster. Hence HDFS is highly used as a platform for storing huge volume and different varieties of data worldwide. 
  • It provides high throughput access to application data by providing the data access in parallel.
Design of HDFS
     HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

1. Very large files
     “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

2. Streaming data access
     HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

3. Commodity hardware
     Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. 

Advantages Of HDFS
1. Distributed Storage
    In HDFS all the features are achieved via distributed storage and replication. When you access Hadoop Distributed file system from any of the ten machines in the Hadoop cluster, you will feel as if you have logged into a single large machine which has a storage capacity of 10 TB (total storage over ten machines). What does it mean? It means that you can store a single large file of 10 TB which will be distributed over the ten machines (1 TB each). So, it is not limited to the physical boundaries of each individual machine.

2. Distributed & Parallel Computation
     When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for Parallel Computation. Let’s understand this concept by the above example. Suppose, it takes 35 minutes to process 1 TB file on a single machine. So, now tell me, how much time will it take to process the same 1 TB file when you have 10 machines in a Hadoop cluster with similar configuration – 35 minutes or 3.5 minutes? 3.5 minutes, Right? What happened here? Each of the nodes is working with a part of the 1 TB file in parallel. Therefore, the work which was taking 35 minutes before, gets finished in just 3.5 minutes now as the work got divided over ten machines.

3. Horizontal Scalability
     There are two types of scaling: vertical and horizontal. In vertical scaling (scale up), you increase the hardware capacity of your system. In other words, you procure more RAM or CPU and add it to your existing system to make it more robust and powerful. But there are challenges associated with vertical scaling or scaling up
1. There is always a limit to which you can increase your hardware capacity. So, you can’t keep on increasing the RAM or CPU of the machine.
2. In vertical scaling, you stop your machine first. Then you increase the RAM or CPU to make it a more robust hardware stack. After you have increased your hardware capacity, you restart the machine. This down time when you are stopping your system becomes a challenge.
     In case of horizontal scaling (scale out), you add more nodes to existing cluster instead of increasing the hardware capacity of individual machines. And most importantly, you can add more machines on the go i.e. Without stopping the system. Therefore, while scaling out we don’t have any down time or green zone, nothing of such sort. At the end of the day, you will have more machines working in parallel to meet your requirements. 

Features of HDFS
1. Cost

     The HDFS, in general, is deployed on a commodity hardware like your desktop/laptop which you use every day. So, it is very economical in terms of the cost of ownership of the project. Since, we are using low cost commodity hardware, you don’t need to spend huge amount of money for scaling out your Hadoop cluster. In other words, adding more nodes to your HDFS is cost effective.

2. Variety and Volume of Data
     When we talk about HDFS then we talk about storing huge data i.e. Terabytes & petabytes of data and different kinds of data. So, you can store any type of data into HDFS, be it structured, unstructured or semi structured.

3. High Throughput
     Throughput is the amount of work done in a unit time. Throughput talks about how fast you can access the data from the file system. Basically, it gives you an insight about the system performance. As you have seen in the above example where we used ten machines collectively to enhance computation. There we were able to reduce the processing time from 35 minutes to 3.5 minutes as all the machines were working in parallel. Therefore, by processing data in parallel, we decreased the processing time tremendously and thus, achieved high throughput. 

4. Data Reliability
     Data Replication is one of the most important and unique features of Hadoop HDFS. HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It stores data reliably on a cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating a replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then a user can easily access that data from the other nodes which contain a copy of same data in the HDFS cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.

5. Fault Tolerance
Issues in legacy systems
     In legacy systems like RDBMS, all the read and write operation performed by the user, were done on a single machine.  And if due to some unfavorable conditions like machine failure, RAM Crash, Hard-disk failure, power down, etc. the users have to wait until the issue is manually corrected. So, at the time of machine crashing or failure, the user cannot access their data until the issue in the machine gets recovered  and becomes available for the user. Also, in legacy systems we can store data in the range of GBs only. So, in order to increase the data storage capacity, one has to buy a new server machine. Hence to store a huge amount of data one has to buy a number of server machines, due to which the cost becomes very expensive.

Fault tolerance achieved in Hadoop
     Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on another rack. Hence if suddenly a machine fails, then a user can access data from other slaves present in another rack.
One simple Example 
     Suppose there is a user data named FILE. This data FILE is divided in into blocks say B1, B2, B3 and send to Master. Now master sends these blocks to the slaves say S1, S2, and S3. Now slaves create replica of these blocks to the other slaves present in the cluster say S4, S5 and S6. Hence multiple copies of blocks are created on slaves. Say S1 contains B1 and B2, S2 contains B2 and B3, S3 contains B3 and B1, S4 contains B2 and B3, S5 contains B3 and B1, S6 contains B1 and B2. Now if due to some reasons slave S4 gets crashed. Hence data present in S4 was B2 and B3 become unavailable. But we don’t have to worry because we can get the blocks B2 and B3 from another slave S2. Hence in unfavourable conditions also our data doesn’t get lost. Hence HDFS is highly fault tolerant.

6. Data Integrity
     Data Integrity talks about whether the data stored in my HDFS are correct or not. HDFS constantly checks the integrity of data stored against its checksum. If it finds any fault, it reports to the Name Node about it. Then, the Name Node creates additional new replicas and therefore deletes the corrupted copies.

7. Data Locality
     Data locality talks about moving computation (processing unit) to data rather than the data to computation. In traditional system, we used to bring the data to the application layer and then process it. But now, because of the architecture and huge volume of the data, bringing the data to the application layer will reduce the network performance to a noticeable extent. So, in HDFS, we bring the computation part to the data nodes where the data is residing. Hence, you are not moving the data, you are bringing the program or processing part to the data.

8. Blocks
     Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. In general, in any of the File System, you store the data as a collection of blocks.
     Similarly, Hadoop Distributed File System (HDFS) also stores the data in terms of blocks. However, the block size in HDFS is very large compare with other file system blocks because Hadoop works best with very large files. The default size of HDFS block is 64MB in Hadoop 1.x and 128MB in Hadoop 2.x. The files are split into 64MB/128MB blocks and then stored into the Hadoop filesystem. This value is configurable, and it can be customized. Using "dfs.block.size" parameter in hdfs-site.xml file you can configure your own block size. The Hadoop application is responsible for distributing the data block across multiple nodes. 

We have to give block size in bytes in hdfs-site.xml
Example,
<property>
   <name>dfs.block.size<name>
   <value>134217728<value> // 134217728 bytes = 128 MB
<property>
     If the data size is less than the block size, then block size will be equal to the data size. For example, your file size is 400MB and suppose we are using the default configuration of block size 128MB. Then 4 blocks are created, the first 3 blocks will be of 128MB, but the last block will be of 16 MB size only instead of 128MB.

Why HDFS Blocks are Large in Size?
     Whenever we talk about HDFS, we talk about huge data sets, i.e. Terabytes and Petabytes of data. So, if we had a block size of let’s say of 4 KB, as in Linux file system, we would be having too many blocks and therefore too much of the metadata. So, managing these no. of blocks and metadata will create huge overhead, which is something, we don’t want.
     The main reason for having the HDFS blocks in large size is to reduce the cost of disk seek time. Disk seeks are generally expensive operations. Since Hadoop is designed to run over your entire dataset, it is best to minimize seeks by using large files. In general, the seek time is 10ms and disk transfer rate is 100MB/s. To make the seek time 1% of the disk transfer rate, the block size should be 100MB. Hence to reduce the cost of disk seek time HDFS block default size is 64MB/128MB. 

What should be the block size of Hadoop cluster?
     An ideal Data Block size is based on several factors
1. Size of Cluster
2. Size of the input file
3. Map task capacity 
     Actually, there is no such rule to keep block size is 64MB/128MB/256MB. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (may be 128MB or even 256MB) is best. But on the other hand, for smaller files using a smaller block size is better. So, here we are dealing with larger file large block & smaller file small blocks. 
     It's recommended to use 128MB Size. For Larger Set Data means the huge amount of data sets and require enormous clusters of thousands of machines, and it's recommended to use 256MB.

Advantages of HDFS Blocks
1. The blocks are of fixed size, so it is very easy to calculate the number of blocks that can be stored on a disk.
2. HDFS block concept simplifies the storage of the DataNodes. The DataNodes doesn’t need to concern about the blocks metadata data like file permissions etc. The NameNode maintains the metadata of all the blocks.
3. If the size of the file is less than the HDFS block size, then the file does not occupy the complete block storage.
4. As the file is chunked into blocks, it is easy to store a file that is larger than the disk size as the data blocks are distributed and stored on multiple nodes in a Hadoop cluster.
5. Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability of HDFS. 

9. Replication
    
Replication factor is nothing but how many copies of your HDFS block should be copied (replicated) in Hadoop cluster. Using Replication factor, we can achieve Fault Tolerant and High Availability. Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. Similarly, In Hadoop also when one DataNode is down, client access data without any interruption and loss. Because we have copy of same in another DataNode.
     The default replication factor is 3 which can be configured as per our requirement. It can be changed to 2 (less than 3) or can be increased (more than 3).
     Using "dfs.replication" in hdfs-site.xml file we can configure replication factor. 
Example,
<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>
     The ideal replication factor is considered to be 3. Because of following reasons,
To make HDFS Fault Tolerant we have to consider 2 cases
1. DataNode failure
2. Rack Failure
      So, if one DataNode will down another copy will be available another DataNode. Suppose the 2nd Data Node also goes down but still that Data will be Highly Available as we still one copy of it stored onto different DataNode. 
     HDFS stores at least 1 replica to another Rack. So, in the case of entire Rack failure then you can get the data from the DataNode from another rack. Replication Factor also depends on Rack Awareness, which states the rule as "No more than one copy on the same node and no more than 2 copies in the same rack", with this rule data remains highly available if the replication factor is 3. So, to achieve HDFS Fault Tolerant ideal replication factor is considered to be 3.

10. High Availability
Issues in legacy systems
1. Data unavailable due to the crashing of a machine.
2. Users have to wait for a long period of time to access their data, sometimes users have to wait for a particular period of time till the website becomes up.
3. Due to unavailability of data, completion of many major projects at organizations gets extended for a long period of time and hence companies have to go through critical situations.
4. Limited features and functionalities.

High Availability achieved in Hadoop
     HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster.

One simple Example
     HDFS provides High availability of data. Whenever user requests for data access to the NameNode, then the NameNode searches for all the nodes in which that data is available. And then provides access to that data to the user from the node in which data was quickly available. While searching for data on all the nodes in the cluster, if NameNode finds some node to be dead, then without user knowledge NameNode redirects the user to the other node in which the same data is available. Without any interruption, data is made available to the user. So, in conditions of node failure also data is highly available to the users. Also, applications do not get affected because of any individual nodes failure.

HDFS Nodes
     An HDFS cluster has 2 types of nodes operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers). The master node maintains various data storage and processing management services in distributed Hadoop clusters. The actual data in HDFS is stored in Slave nodes. Data is also processed on the slave nodes. 
1. Master node
     Master node also called NameNode. As the name suggests, this node manages all the slave nodes and assign work to slaves. Master is the centerpiece of HDFS. It stores the metadata of HDFS. All the information related to files stored in HDFS gets stored in Master. It also gives information about where across the cluster the file data is kept. Master contains information about the details of the blocks and its location for all files present in HDFS. Master is the most critical part of HDFS and if all the masters get crashed or down then the HDFS cluster is also considered down and becomes useless. Two files 'Namespace image' and the 'edit log' are used to store metadata information. 

2. Slave node
      Slave node also called DataNode. Datanodes are the slaves which are deployed on each machine and provide the actual storage. They are the actual worker nodes. These are responsible for serving read and write requests from the clients. They also perform block creation, deletion, and replication upon instruction from the NameNode. They can be deployed on commodity hardware. If any slave node goes down, NameNode automatically replicates the blocks which were present at that data node to other nodes in the cluster.
HDFS Daemons
     In Hadoop HDFS there are 2 daemons. All the daemons run on their own JVMs in the background to support required services.
1. NameNode
  • Namenode daemon runs on all the masters.
  • NameNode manages the HDFS filesystem namespace. NameNode stores metadata like file name, path, number of blocks, block Ids, block locations, number of replicas, slave related config etc. 
  • NameNode keeps the record of the changes created in file system namespace.
  • NameNode keeps the meta data in memory for fast retrival.
2. DataNode
  • DataNode daemon runs on all the slaves. 
  • The function of DataNode is to store the actual data in the HDFS. It contains the actual data blocks. HDFS cluster usually has more than one DataNodes. 
  • Data is replicated across the other machines present in the HDFS cluster. 
Data storage in HDFS 
 Whenever any file has to be written in HDFS, it is broken into small pieces of data known as blocks. HDFS has a default block size of 64MB (Hadoop 1.x) or 128 MB (Hadoop 2.x) which can be increased as per the requirements. These blocks are stored in the cluster in distributed manner on different nodes. This provides a mechanism for MapReduce to process the data in parallel in the cluster. 
     Multiple copies of each block are stored across the cluster on different nodes. This is a replication of data. By default, HDFS has a replication factor of 3. It provides fault tolerance, reliability, and high availability.
     A Large file is split into n number of small blocks. These blocks are stored at different nodes in the cluster in a distributed manner. Each block is replicated and stored across different nodes in the cluster.

HDFS Architecture 

     Apache HDFS (Hadoop Distributed File System) is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster consists of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

1. HDFS NameNode

      NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only. 
  • NameNode is the main central component of HDFS architecture framework. 
  • NameNode is also known as Master node. 
  • HDFS Namenode stores meta-data i.e. number of data blocks, file name, path, Block IDs, Block location, no. of replicas, and also Slave related configuration. This meta-data is available in memory in the master for faster retrieval of data.
  • NameNode keeps metadata related to the file system namespace in memory, for quicker response time. Hence, more memory is needed. So NameNode configuration should be deployed on reliable configuration.
  • NameNode maintains and manages the slave nodes, and assigns tasks to them. 
  • NameNode has knowledge of all the DataNodes containing data blocks for a given file.
  • NameNode  coordinates with hundreds or thousands of data nodes and serves the requests coming from client applications.
  • Two files 'FSImage' and the 'EditLog' are used to store metadata information.
    • FsImage: It is the snapshot the file system when Name Node is started. It is an “Image file”. FsImage contains the entire filesystem namespace and stored as a file in the NameNode’s local file system. It also contains a serialized form of all the directories and file inodes in the filesystem. Each inode is an internal representation of file or directory’s metadata.
    • EditLogs: It contains all the recent modifications made to the file system on the most recent FsImage. NameNode receives a create/update/delete request from the client. After that this request is first recorded to edits file.
Functions of NameNode
  • It is the master daemon that maintains and manages the DataNodes (slave nodes).
  • It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. FsImage and EditLogs files associated with the metadata
  • It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
  • It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
  • The NameNode is also responsible to take care of the replication factor of all the blocks.
  • In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.
2. HDFS DataNode
     DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

  • DataNode is also known as Slave node. 
  • In Hadoop HDFS Architecture, DataNode stores actual data in HDFS. 
  • DataNodes responsible for serving, read and write requests for the clients.
  • DataNodes can deploy on commodity hardware.
  • DataNodes sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.
  • When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
  • DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
Functions of DataNode in HDFS
  • These are slave daemons or process which runs on each slave machine.
  • The actual data is stored on DataNodes.
  • The DataNodes perform the low-level read and write requests from the file system’s clients.
  • Every DataNode sends a heartbeat message to the Name Node every 3 seconds and conveys that it is alive. In the scenario when Name Node does not receive a heartbeat from a Data Node for 10 minutes, the Name Node considers that particular Data Node as dead and starts the process of Block replication on some other Data Node.
  • All Data Nodes are synchronized in the Hadoop cluster in a way that they can communicate with one another and make sure of
    • Balancing the data in the system
    • Move data for keeping high replication
    • Copy Data when required.
3. HDFS Secondary NameNode
     NameNode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But this information also stored in disk for persistence storage.

The above image shows how Name Node stores information in disk.
Two different files are
1. fsimage – It is the snapshot of the filesystem when NameNode started
2. Edit logs – It is the sequence of changes made to the filesystem after NameNode started.
     When the NameNode is in the active state the edit logs size grows continuously. Only in the restart of NameNode , edit logs are applied to fsimage to get the latest snapshot of the file system. But NameNode restart are rare in production clusters which means edit logs can grow very large for the clusters where NameNode runs for a long period of time. The following issues we will encounter in this situation.

  • Editlog become very large, which will be challenging to manage it.
  • Namenode restart takes long time because lot of changes has to be merged.
  • In the case of crash, we will have lost huge amount of metadata since fsimage is very old.
     So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on NameNode reduces. It’s very similar to Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong, we can fall back to the last restore point.
Secondary NameNode
     Secondary NameNode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage from the NameNode.

     Secondary NameNode also contains a namespace image and edit logs like NameNode. Now after every certain interval of time (which is one hour by default) it copies the namespace image from NameNode and merge this namespace image with the edit log and copy it back to the NameNode so that NameNode will have the fresh copy of namespace image. Now let's suppose at any instance of time the NameNode goes down and becomes corrupt then we can restart some other machine with the namespace image and the edit log that’s what we have with the Secondary NameNode and hence can be prevented from a total failure.
     Secondary Name node takes almost the same amount of memory and CPU for its working as the NameNode. So, it is also kept in a separate machine like that of a NameNode.

Functions of Secondary NameNode

  • The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
  • It is responsible for combining the EditLogs with FsImage from the NameNode. 
  • It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.
     Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode. Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature. Secondary Namenode is optional now & Standby Namenode has been to used for failover process. Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes.
     In Hadoop 2.x. there are two types of NameNodes in a cluster

1. Active NameNode
2. StandBy NameNode
     The Active NameNode is responsible for all client operations in the cluster, while the Standby NameNode is simply acting as a slave. The Active NameNode is a single point of failure for the HDFS cluster. When the NameNode goes down, the file system will go offline. There is an optional Standby NameNode that can be hosted on a separate machine. So when active NameNode goes down, standby name node can be used as active NameNode and file system will be safe. This process is called failover.
Note 

1. Heartbeat
     Heartbeat is the signal that your DataNodes sends to the NameNode periodicly/continuously. 
What is the need of Heartbeat?
     In Hadoop, NameNode and DataNode are two physically separated machines, then how NameNode know perticular DataNode is down? To make NameNode aware of the status, each DataNode does the following,
  • NameNode and DataNode do communicate using Heartbeat.
  • DataNodes send a Heartbeat to the NameNode every 3 seconds by default to indicate its presence, i.e. to indicate that it is alive.
  • If NameNode doesn't receive the Heartbeat for a perticular amount of time (by default 10 minutes) then NameNode considered as that DataNode/Slave is dead node.
  • You can set the heartbeat interval in the hdfs-site.xml file by configuring the parameter dfs.heartbeat.interval.
2. Balancing
     If one data node is down then some of the blocks are under replicated. This is not my design condition means my replication factor is 3 but some blocks are not replicated as 3 due to DataNode failure. So, we can balance the cluster. We don't do anything. Just need to run command. HDFS balancer doesn't run at background, has to run manually. To run HDFS balancer Command
                                        hdfs balancer [-threshold <threshold>]
     As soon as NameNode receives the balancer command then NameNode will check this particular DataNode is down. Now NameNode checks in meta data what are the blocks available in that Node and then NameNode will issue an instruction to DataNode to replicate what are the blocks available in that DataNode. Now DataNode replicate the blocks 1 more copy to another DataNode. (For example, failed DataNode have B1 and B4 blocks only those replicated blocks available in another node will replicate). This is my desire condition i.e., all my blocks are replicated as I configured. 
     If any kind of situation if failed DataNode is up and running fine. Now my blocks are over replicated. This is not my desired condition? In previous case blocks are over replicated that's why NameNode created one more replica. Now it is over replicated. Now what NameNode will it do? Now NameNode will issue an instruction to DataNode to delete the one copy of replica. (Not newly created replica. It depends. Delete blocks randomly which are not processed/accessed by the client.

3. Replication
     Replication factor is nothing but how many copies of your HDFS block should be copied (replicated) in Hadoop cluster. Using Replication factor, we can achieve Fault Tolerant and High Availability. Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. Similarly, In Hadoop also when one DataNode is down, client access data without any interruption and loss. Because we have copy of same in another DataNode.
     The default replication factor is 3 which can be configured as per our requirement. It can be changed to 2 (less than 3) or can be increased (more than 3). Using "dfs.replication" in hdfs-site.xml file we can configure replication factor. 
Example,
<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

    The ideal replication factor is considered to be 3. Because of following reasons,
To make HDFS Fault Tolerant we have to consider 2 cases
1. DataNode failure
2. Rack Failure
    So, if one DataNode will down another copy will be available another DataNode. Suppose the 2nd Data Node also goes down but still that Data will be Highly Available as we still one copy of it stored onto different DataNode.
    HDFS stores at least 1 replica to another Rack. So, in the case of entire Rack failure then you can get the data from the DataNode from another rack. Replication Factor also depends on Rack Awareness, which states the rule as "No more than one copy on the same node and no more than 2 copies in the same rack", with this rule data remains highly available if the replication factor is 3. So, to achieve HDFS Fault Tolerant ideal replication factor is considered to be 3. 


Next Tutorial : HDFS Read/ Write Architecture

Previous Tutorial : Hadoop Multinode Cluster Setup

4 comments: