Recent Posts

Wednesday, 14 March 2018

Map Reduce Tutorial

Map Reduce: Introduction and Need
     Data is everywhere. There is no way to measure the exact total volume of data stored electronically but it should be in zettabytes (A zettabyte is 1 billion terabytes). So, we have a lot of data and we are struggling to store and analyze it. Now let's try to figure out what is wrong with dealing with such an enormous volume of data. The answer is same what's wrong in data from PC to portable drives: Speed. Over the years the storage capacities of hard drives have increased like anything but the rate at which data can be read from drives have not kept up and writing is even slower.
Solution:
     To read from multiple disks at a time. For instance, if we have to store 100 GB of data and we have 100 drives with 100 GB storage space, then it would be faster to read data from 100 drives, each holding 1 GB of data than a single drive holding 100 GB of data.

Problem with solution:
1. When using a large number of hardware pieces (hard disks) there is high possibility that one or two might fail.
2. Correctly combining the data from different hard disks.
     MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into computation over sets of keys and values. So, MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.

Introduction to MapReduce
1. MapReduce is the processing layer of Hadoop. It is the heart of Hadoop.
2. MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.
3. We need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.
4. MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. Here in MapReduce, we get inputs from a list and it converts it into output which is again a list.

MapReduce – High level Understanding
1. Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers.
2. A problem is divided into a large number of smaller problems each of which is processed to give individual outputs. These individual outputs are further processed to give final output.
3. Hadoop Map-Reduce is scalable and can also be used across many computers.
4. Many small machines can be used to process jobs that could not be processed by a large machine.
MapReduce Terminologies
     Map Reduce is the data processing component of Hadoop. Map-Reduce programs transform lists of input data elements into lists of output data elements. A Map-Reduce program will do this twice, using two different list processing idioms
  • Map
  • Reduce
     In between Map and Reduce, there is small phase called Shuffle and Sort. Let’s understand basic terminologies used in Map Reduce
Job
1. A “complete program” – an execution of a Mapper and Reducer across a data set.
2. Job is an execution of 2 processing layers i.e. mapper and reducer. A MapReduce job is a work that the client wants to be performed. 
3. It consists of the input data, the MapReduce Program, and configuration info. 
4. So, client needs to submit input data, he needs to write Map Reduce program and set the configuration info (These were provided during Hadoop setup in the configuration file and also we specify some configurations in our program itself which will be specific to our map reduce job).

Task
1. An execution of a Mapper or a Reducer on a slice of data. It is also called Task-In-Progress (TIP). 
Task means processing of data is in progress either on mapper or reducer.
2. Master divides the work into task and schedules it on the slaves.

Task Attempt
1. A particular instance of an attempt to execute a task on a node. 
2. There is a possibility that anytime any machine can go down. For example, while processing data if any node goes down, framework reschedules the task to some other node. This rescheduling of the task cannot be infinite. There is an upper limit for that as well. The default value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then the job is considered as a failed job. For high priority job or huge job, the value of this task attempt can also be increased.

Map Abstraction
     Let us understand the abstract form of Map, the first phase of MapReduce paradigm, what is a map/mapper, what is the input to the mapper, how it processes the data, what is output from the mapper? The map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.
1. Key is a reference to the input value.
2. Value is the data set on which to operate.

Map Processing
1. A function defined by user – user can write custom business logic according to his need to process the data.
2. Applies to every value in value input.

Map produces a new list of key/value pairs
1. An output of Map is called intermediate output.
2. It can be the different type from input pair.
3. An output of map is stored on the local disk from where it is shuffled to reduce nodes.

Reduce Abstraction
     Now let’s discuss the second phase of MapReduce – Reducer, what is the input to the reducer, what work reducer does, where reducer writes output?
     Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually, in the reducer, we do aggregation or summation sort of computation.
1. Input given to reducer is generated by Map (intermediate output)
2. Key / Value pairs provided to reduce are sorted by key

Reduce processing:
1. A function defined by user – Here also user can write custom business logic and get the final output.
2. Iterator supplies the values for a given key to the Reduce function.

Reduce produces a final list of key/value pairs:
1. An output of Reduce is called Final output.
2. It can be a different type from input pair.
3. An output of Reduce is stored in HDFS.

How Map and Reduce Work Together?  
     As shown in above figure Input data given to mapper is processed through user defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as the number of mappers is much more than the number of reducers. Mapper generates an output which is intermediate data and this output goes as input to reducer.
     This intermediate result is then processed by user defined function written at reducer and final output is generated. Usually, in reducer very light processing is done. This final output is stored in HDFS and replication is done as usual.


MapReduce Data Flow
     Now let’s understand complete end to end data flow of Hadoop MapReduce, how input is given to the mapper, how mappers process data, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers?
     As seen from the diagram, the square block is a slave. There are 3 slaves in the figure. On all 3 slaves mappers will run, and then a reducer will run on any 1 of the slave. For simplicity of the figure, the reducer is shown on a different machine but it will run on mapper node only. An input to a mapper is 1 block at a time. (Split = block by default). An output of mapper is written to a local disk of the machine on which mapper is running. Once the map finishes, this intermediate output travels to reducer nodes (node where reducer will run).
     Reducer is the second phase of processing where the user can again write his custom business logic. Hence, an output of reducer is the final output written to HDFS. By default, on a slave, 2 mappers run at a time which can also be increased as per the requirements. It depends again on factors like DataNode hardware, block size, machine configuration etc. We should not increase the number of mappers beyond the certain limit because it will decrease the performance.
     Mapper writes the output to the local disk of the machine it is working. This is the temporary data. An output of mapper is also called intermediate output. All mappers are writing the output to the local disk. As First mapper finishes, data (output of the mapper) is traveling from mapper node to reducer node. Hence, this movement of output from mapper node to reducer node is called shuffle.
     Reducer is also deployed on any one of the DataNode only. An output from all the mappers goes to the reducer. All these outputs from different mappers are merged to form input for the reducer. This input is also on local disk. Reducer is another processor where you can write custom business logic. It is the second stage of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities. Hence, Reducer gives the final output which it writes on HDFS.
     Map and reduce are the stages of processing. They run one after other. After all, mappers complete the processing, then only reducer starts processing. Though 1 block is present at 3 different locations by default, but framework allows only 1 mapper to process 1 block. So only 1 mapper will be processing 1 particular block out of 3 replicas. The output of every mapper goes to every reducer in the cluster i.e every reducer receives input from all the mappers. Hence, framework indicates reducer that whole data has processed by the mapper and now reducer can process the data.
     An output from mapper is partitioned and filtered to many partitions by the partitioner. Each of this partition goes to a reducer based on some conditions. Hadoop works with key value principle i.e. mapper and reducer gets the input in the form of key and value and write output also in the same form.

Data Locality
     Hadoop is specifically designed to solve Big Data Problems, so it will have to deal with larger amounts of data, so here it is not feasible to move such larger data sets towards computation. So, Hadoop provides this feature called as Data Locality.
     The major drawback of Hadoop was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data Locality came into the picture. Data locality refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.
     In Hadoop, datasets are stored in HDFS. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sent this MapReduce code to the datanodes on which data is available related to MapReduce job.

Categories of Data locality in Hadoop.
     Below are the various categories in which Hadoop Data Locality is categorized
1. Data local data locality in Hadoop
     When the data is located on the same node as the mapper working on the data it is known as data local data locality. In this case, the proximity of data is very near to computation. This is the most preferred scenario.

2. Intra-Rack data locality in Hadoop
     It is not always possible to execute the mapper on the same datanode due to resource constraints. In such case, it is preferred to run the mapper on the different node but on the same rack.

3. Inter-Rack data locality in Hadoop
     Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In such case, we will execute the mapper on the nodes on different racks. This is the least preferred scenario.

Next Tutorial : Map Reduce Tutorial Part 2

Previous Tutorial : HDFS Federation in Hadoop 
 

1 comment: