Recent Posts

Wednesday, 11 April 2018

InputSplit vs Block

HDFS Block
     Hadoop HDFS split large files into small chunks known as Blocks. Block is a continuous location on the hard drive where data is stored. In general, FileSystem stores data as a collection of blocks. In the same way, HDFS stores each file as blocks. The Hadoop application is responsible for distributing the data block across multiple nodes.

InputSplit in Hadoop
     The data to be processed by an individual Mapper is represented by InputSplit. The split is divided into records and each record (which is a key-value pair) is processed by the map. The number of map tasks is equal to the number of InputSplits. Initially, the data for MapReduce task is stored in input files and input files typically reside in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible for creating InputSplit.

Comparison Between InputSplit vs Blocks in Hadoop
1. Size
Block – By default, the HDFS block size is 128MB which you can change as per your requirement. All HDFS blocks are the same size except the last block, which can be either the same size or smaller. Hadoop framework break files into 128 MB blocks and then stores into the Hadoop file system.

InputSplit – InputSplit size by default is approximately equal to block size. It is user defined. In MapReduce program the user can control split size based on the size of data.

2. Data Representation
Block – HDFS Block is the physical representation of data in Hadoop. It contains a minimum amount of data that can be read or write.

InputSplit – MapReduce InputSplit is the logical representation of data present in the block in Hadoop. It is basically used during data processing in MapReduce program or other processing techniques. The main thing to focus is that InputSplit doesn’t contain actual data; it is just a reference to the data.

3. Example of Block vs InputSplit in Hadoop
     Suppose we need to store the file in HDFS.  HDFS stores files as blocks. Block is the smallest unit of data that can be stored or retrieved from the disk and the default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. Suppose we have a file of 130 MB, so HDFS will break this file into 2 blocks.
     Now, if we want to perform MapReduce operation on the blocks, it will not process, because the 2nd block is incomplete. Thus, this problem is solved by InputSplit. InputSplit will form a logical grouping of blocks as a single block, because the InputSplit include a location for the next block and the byte offset of the data needed to complete the block. 

Next Tutorial : Hadoop Map only Job

Previous Tutorial : OutputFormat in Mapreduce 
 

1 comment:

  1. Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. IEEE Projects for CSE in Big Data But it’s not the amount of data that’s important. Project Centers in Chennai It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

    Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Corporate TRaining Spring Framework the authors explore the idea of using Java in Big Data platforms.
    Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

    ReplyDelete