Recent Posts

Wednesday, 11 April 2018

OutputFormat in MapReduce

     OutputFormat check the output specification for execution of the Map-Reduce job. It describes how RecordWriter implementation is used to write output to output files. 

RecordWriter in Hadoop MapReduce
     Reducer takes as input a set of an intermediate key-value pair produced by the mapper and runs a reducer function on them to generate output that is again zero or more key-value pairs. RecordWriter writes these output key-value pairs from the Reducer phase to output files.

Hadoop Output Format
     Hadoop RecordWriter takes output data from Reducer and writes this data to output files. The way these output key-value pairs are written in output files by RecordWriter is determined by the Output Format. The Output Format and InputFormat functions are alike. OutputFormat instances provided by Hadoop are used to write to files on the HDFS or local disk. OutputFormat describes the output-specification for a Map-Reduce job. On the basis of output specification; 
1. MapReduce job checks that the output directory does not already exist. 
2. OutputFormat provides the RecordWriter implementation to be used to write the output files of the job. Output files are stored in a FileSystem.
     FileOutputFormat.setOutputPath() method is used to set the output directory. Every Reducer writes a separate file in a common output directory.
Types of OutputFormat in MapReduce

1. TextOutputFormat
     TextOutputFormat is the default Hadoop reducer Output Format in MapReduce, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them. Each key-value pair is separated by a tab character, which can be changed using mapReduce.output.textoutputformat.separator property. KeyValueTextOutputFormat is used for reading these output text files since it breaks lines into key-value pairs based on a configurable separator.

2. SequenceFileOutputFormat
     SequenceFileOutputFormat is an Output Format which writes sequences files for its output and it is intermediate format use between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next mapper in the same manner as it was emitted by the previous reducer, since these are compact and readily compressible. Compression is controlled by the static methods on SequenceFileOutputFormat.

3. SequenceFileAsBinaryOutputFormat
     SequenceFileAsBinaryOutputFormat is another variant of SequenceFileInputFormat. It also writes keys and values to sequence file in binary format.

4. MapFileOutputFormat
     MapFileOutputFormat is another form of FileOutputFormat. It also writes output as map files. The framework adds a key in a MapFile in order. So we need to ensure that reducer emits keys in sorted order.

5. MultipleOutputs
     MultipleOutputs allows writing data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.

6. LazyOutputFormat
     Sometimes FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is a wrapper OutputFormat which ensures that the output file will be created only when the record is emitted for a given partition.

7. DBOutputFormat
     DBOutputFormat in Hadoop is an Output Format for writing to relational databases and HBase. It sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extending DBwritable. Returned RecordWriter writes only the key to the database with a batch SQL query. 

Next Tutorial : Inputsplit vs Block

Previous Tutorial : Shuffling and Sorting in Hadoop 
 

No comments:

Post a Comment