MapReduce's Manager-Worker Architecture

Let's study the architecture of MapReduce and the guarantees provided by it.

The manager node is responsible for scheduling tasks for worker nodes and managing their execution, as shown in the following illustration:

Press + to interact
Architecture of MapReduce
Architecture of MapReduce

Apart from the definition of the map and reduce functions, the user can also specify the number M of map tasks, the number R of reduce tasks. MapReduce can also specify the number of input or output files, and a partitioning function that defines how key-value pairs from the map tasks are partitioned before being processed by the reduce tasks. By default, a hash partitioner is used that selects a reduce task using the formula hash(key) mod R.

Steps for the execution of MapReduce

The execution of MapReduce proceeds in the following way:

  • The framework divides the input files into M pieces, called input splits, typically between 16 and 64 MB per split.

  • It then starts an instance of a manager node and multiple worker node instances on an existing cluster of machines.

  • The manager selects idle worker nodes and assigns map tasks to them.

  • A worker node assigned a map task reads the contents of the associated input split parses key-value pairs and passes them to the user-defined map function. The entries emitted by the map function are buffered in memory and periodically written to the local disk, partitioned into R regions using the partitioning function. When a worker node completes a map task, it sends the location of the local file to the manager node.

  • The manager node assigns reduce tasks to worker nodes providing the location to the associated files. These worker nodes then perform remote procedure calls (RPCs) to read the data from the local disks of the map workers. The data is first sorted so that all occurrences of the same key are grouped together and then passed into the reduce function.

If the size is prohibitively large to fit in memory, external sorting is used.

  • When all map and reduce tasks are completed, the manager node returns the control to the user program. After successful completion, the output of the MapReduce job is available in the R output files that can either be merged or passed as input to a separate MapReduce job.

The manager node communicates with every worker periodically in the background in a form of a heartbeat. If no response is received for a specific time, the manager node considers the worker node as failed and re-schedules all its tasks for re-execution.

More specifically, completed reduce tasks do not need to be rerun since the output files are stored in an external file system. However, map tasks are rerun regardless of whether they were completed since their output is stored on the local disk and is therefore inaccessible to the reduce tasks that need it.

Note that the network partitions between the manager node and worker nodes might lead to multiple executions of a single map or reduce tasks.

Duplicate map tasks executions are deduplicated at the manager node, which ignores completion messages for already completed map tasks. The reduce tasks write their output to a temporary file, atomically renamed when the reduce task completes.

Note that this atomic rename operation provided by the underlying file system guarantees that the output files will contain just the data produced by a single execution of each reduce task. However, if the mapor reduce functions defined by the application code have additional side effects (e.g., writing to external datastores), the framework does not provide any guarantees. The application writer must make sure these side effects are atomic and idempotent since the framework might trigger them more than once as part of a task re-execution.

Storage

Input and output files are usually stored in a distributed file system, such as HDFS or GFS. MapReduce can take advantage of this to perform several optimizations, such as scheduling map tasks to worker nodes that contain a replica of the corresponding input to minimize network traffic or aligning the size of input splits to the block size of the file system.

Guarantees provided by MapReduce

The MapReduce framework guarantees that the intermediate key-value pairs are processed in increasing key order within a given partition. This ordering guarantee makes it easy to produce a sorted output file per partition, which is helpful for use cases that need to support efficient random access lookups by key or need sorted data in general.

Furthermore, some use-cases would benefit from some form of pre-aggregation at the map level to reduce the amount of data transferred between map and reduce tasks. This was evident in the example presented above. A single map would emit multiple entries for each occurrence of a word instead of a single entry with the number of occurrences. For this reason, the framework allows the application code also to provide a combine function. This method has the same type as the reduce function and is run as part of the map task in order to pre-aggregate the data locally.

Get hands-on with 1400+ tech skills courses.