Advantages Of HDFS
1. Distributed Storage
In HDFS all the features are achieved via distributed storage and replication. When you access Hadoop Distributed file system from any of the ten machines in the Hadoop cluster, you will feel as if you have logged into a single large machine which has a storage capacity of 10 TB (total storage over ten machines). What does it mean? It means that you can store a single large file of 10 TB which will be distributed over the ten machines (1 TB each). So, it is not limited to the physical boundaries of each individual machine.
2. Distributed & Parallel Computation
When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for Parallel Computation. Let’s understand this concept by the above example. Suppose, it takes 35 minutes to process 1 TB file on a single machine. So, now tell me, how much time will it take to process the same 1 TB file when you have 10 machines in a Hadoop cluster with similar configuration – 35 minutes or 3.5 minutes? 3.5 minutes, Right? What happened here? Each of the nodes is working with a part of the 1 TB file in parallel. Therefore, the work which was taking 35 minutes before, gets finished in just 3.5 minutes now as the work got divided over ten machines.
3. Horizontal Scalability
There are two types of scaling: vertical and horizontal. In vertical scaling (scale up), you increase the hardware capacity of your system. In other words, you procure more RAM or CPU and add it to your existing system to make it more robust and powerful. But there are challenges associated with vertical scaling or scaling up
1. There is always a limit to which you can increase your hardware capacity. So, you can’t keep on increasing the RAM or CPU of the machine.
2. In vertical scaling, you stop your machine first. Then you increase the RAM or CPU to make it a more robust hardware stack. After you have increased your hardware capacity, you restart the machine. This down time when you are stopping your system becomes a challenge.
In case of horizontal scaling (scale out), you add more nodes to existing cluster instead of increasing the hardware capacity of individual machines. And most importantly, you can add more machines on the go i.e. Without stopping the system. Therefore, while scaling out we don’t have any down time or green zone, nothing of such sort. At the end of the day, you will have more machines working in parallel to meet your requirements.