Hadoop's Map-Reduce Process
Hadoop MapReduce, a powerful framework for processing large data sets, supports multiple file formats to cater to various data types and processing needs. Here's a breakdown of the common file formats used in MapReduce operations.
TextInputFormat (Default)
This is the default format in Hadoop, treating each line in a text file as a record. Each record is presented as a key-value pair. This format is useful for plain text files where the key is the line offset and value is the content of the line.
KeyValueTextInputFormat
This format splits each line into a pair based on a configurable delimiter. It is useful for text files with explicit key-value pairs per line.
SequenceFileInputFormat
This format reads sequence files, which store data in a binary key-value format optimised for fast I/O and efficient data exchange between MapReduce jobs.
SequenceFileAsTextInputFormat
This format reads sequence files but converts the binary keys and values to text format for processing.
Avro
Avro is a row-based serialization format with a JSON schema. It is well integrated with Hadoop and MapReduce. Avro supports schema evolution and efficient compression, making it suitable for large, complex datasets processed by MapReduce.
Besides these, Hadoop's underlying filesystem (commonly HDFS) can store data in various file systems. However, the above formats are the typical input formats MapReduce understands directly for processing.
Each file stored in HDFS is broken into smaller parts called input splits. In Hadoop, as many mappers are there, those many numbers of pairs are available for the mapper. By default, there is always one reducer per Hadoop cluster. The number of mappers for an input file is equal to the number of input splits of this input file.
In the Map phase, the Map task is utilized. The functioning of Map Reduce involves counting the number of each word in a file as an example. In the Shuffling Phase, all values associated with an identical key are combined. After the record reader and mapper processing, as many reducers are there, those many number of output files are generated. The final processed output after the reducer processing is stored in the specified output directory.
In the Reduce phase, the Reduce task is utilized. The output of the 'word count' code will show the count of each word in the file. Thus, Hadoop MapReduce supports multiple file formats designed to optimize for different data types and processing needs, including plain text, key-value text, binary sequence files, and schema-based serialized formats like Avro.
In data-and-cloud-computing, education-and-self-development, and technology realms, Hadoop MapReduce leverages trie architecture to process diverse learning materials, facilitating complex data structures for efficient organization and retrieval. Avro, a schema-based serialization format, is an integral part of this technology, providing expansion capabilities for data types and supporting effective compression – advantageous for large, self-development datasets.