MapReduce's Manager-Worker Architecture
Let's study the architecture of MapReduce and the guarantees provided by it.
We'll cover the following
The manager node is responsible for scheduling tasks for worker nodes and managing their execution, as shown in the following illustration:
Apart from the definition of the map
and reduce
functions, the user can also specify the number M
of map
tasks, the number R
of reduce
tasks. MapReduce can also specify the number of input or output files, and a partitioning function that defines how key-value pairs from the map
tasks are partitioned before being processed by the reduce
tasks. By default, a hash partitioner is used that selects a reduce
task using the formula hash(key) mod R
.
Steps for the execution of MapReduce
The execution of MapReduce proceeds in the following way:
-
The framework divides the input files into
M
pieces, called input splits, typically between 16 and 64 MB per split. -
It then starts an instance of a manager node and multiple worker node instances on an existing cluster of machines.
-
The manager selects idle worker nodes and assigns
map
tasks to them. -
A worker node assigned a
map
task reads the contents of the associated input split parses key-value pairs and passes them to the user-definedmap
function. The entries emitted by the map function are buffered in memory and periodically written to the local disk, partitioned intoR
regions using the partitioning function. When a worker node completes amap
task, it sends the location of the local file to the manager node. -
The manager node assigns
reduce
tasks to worker nodes providing the location to the associated files. These worker nodes then perform remote procedure calls (RPCs) to read the data from the local disks of the map workers. The data is first sorted so that all occurrences of the same key are grouped together and then passed into the reduce function.
If the size is prohibitively large to fit in memory, external sorting is used.
- When all
map
andreduce
tasks are completed, the manager node returns the control to the user program. After successful completion, the output of the MapReduce job is available in theR
output files that can either be merged or passed as input to a separate MapReduce job.
The manager node communicates with every worker periodically in the background in a form of a heartbeat. If no response is received for a specific time, the manager node considers the worker node as failed and re-schedules all its tasks for re-execution.
More specifically, completed reduce
tasks do not need to be rerun since the output files are stored in an external file system. However, map
tasks are rerun regardless of whether they were completed since their output is stored on the local disk and is therefore inaccessible to the reduce
tasks that need it.
Note that the network partitions between the manager node and worker nodes might lead to multiple executions of a single
map
orreduce
tasks.
Duplicate map
tasks executions are deduplicated at the manager node, which ignores completion messages for already completed map
tasks. The reduce
tasks write their output to a temporary file, atomically renamed when the reduce task completes.
Note that this atomic rename operation provided by the underlying file system guarantees that the output files will contain just the data produced by a single execution of each
reduce
task. However, if themap
orreduce
functions defined by the application code have additional side effects (e.g., writing to external datastores), the framework does not provide any guarantees. The application writer must make sure these side effects are atomic and idempotent since the framework might trigger them more than once as part of a task re-execution.
Storage
Input and output files are usually stored in a distributed file system, such as HDFS or GFS. MapReduce can take advantage of this to perform several optimizations, such as scheduling map
tasks to worker nodes that contain a replica of the corresponding input to minimize network traffic or aligning the size of input splits to the block size of the file system.
Guarantees provided by MapReduce
The MapReduce framework guarantees that the intermediate key-value pairs are processed in increasing key order within a given partition. This ordering guarantee makes it easy to produce a sorted output file per partition, which is helpful for use cases that need to support efficient random access lookups by key or need sorted data in general.
Furthermore, some use-cases would benefit from some form of pre-aggregation at the map level to reduce the amount of data transferred between map
and reduce
tasks. This was evident in the example presented above. A single map would emit multiple entries for each occurrence of a word instead of a single entry with the number of occurrences. For this reason, the framework allows the application code also to provide a combine
function. This method has the same type as the reduce
function and is run as part of the map task in order to pre-aggregate the data locally.
Get hands-on with 1400+ tech skills courses.