InputSplit vs Block
Hadoop HDFS split large files into small chunks known as Blocks. Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.
InputSplit in Hadoop
The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.
Comparison Between InputSplit vs Blocks in Hadoop
Block – By default, the HDFS block size is 128MB which you can change as per your requirement. All HDFS blocks are the same size except the last block, which can be either the same size or smaller. Hadoop framework break files into 128 MB blocks and then stores into the Hadoop file system.
InputSplit – InputSplit size by default is approximately equal to block size. It is user defined. In MapReduce program the user can control split size based on the size of data.
2. Data Representation
Block – HDFS Block is the physical representation of data in Hadoop. It contains a minimum amount of data that can be read or write.
InputSplit – MapReduce InputSplit is the logical representation of data present in the block in Hadoop. It is basically used during data processing in MapReduce program or other processing techniques. The main thing to focus is that InputSplit doesn’t contain actual data; it is just a reference to the data.
3. Example of Block vs InputSplit in Hadoop
Suppose we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can be stored or retrieved from the disk and the default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. Suppose we have a file of 130 MB, so HDFS will break this file into 2 blocks.
Now, if we want to perform MapReduce operation on the blocks, it will not process, because the 2nd block is incomplete. Thus, this problem is solved by InputSplit. InputSplit will form a logical grouping of blocks as a single block, because the InputSplit include a location for the next block and the byte offset of the data needed to complete the block.