Hadoop Shuffling and Sorting

Hadoop Shuffling and Sorting
Capture 53

The process of transferring data from the mappers to reducers is known as shuffling i.e., the process by which the system performs the sort and transfers the map output to the reducer as input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input. As shuffling can start even before the map phase has finished so this saves some time and completes the tasks in lesser time.

The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.

Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should start. This saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted input data is different than the previous. Each reduce task takes key-value pairs as input and generates key-value pair as output. Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).

Secondary Sorting in MapReduce

If we want to sort reducer’s values, then the secondary sorting technique is used as it enables us to sort the values (in ascending or descending order) passed to each reducer.

Hadoop Shuffling and Sorting
Scroll to top