Introduction to MapReduce

Let's find the inspiration behind MapReduce. Also, look into the goal and the working of MapReduce.

MapReduce is a framework for batch data processing originally developed internally in Google by Dean et al… It was later incorporated in the wider Apache Hadoop framework.

Inspiration

The framework draws inspiration from the field of functional programming and is based on the following main idea:

Idea

Many real-word computations can be expressed with the use of two main primitive functions, map and reduce.

Map

The map function processes a set of key-value pairs and produces another set of intermediate key-value pairs as output.

Reduce

The reduce function receives all the values for each key and returns a single value, essentially merging all the values according to some logic

An important property of map and reduce functions

The map and reduce functions can easily be parallelized and run across multiple machines for different parts of the dataset. As a result,

  • The application code is responsible for defining these two methods.
  • The framework is responsible for partitioning the data, scheduling the program’s execution across multiple nodes, handling node failures, and managing the required inter-machine communication.

Working of MapReduce

Let’s see a typical example to understand better how this programming model practically works.

Suppose we have a huge collection of documents (e.g., webpages), and we need to count the number of occurrences for each word. To achieve that via MapReduce, we would use the following functions:

// key: the document name
// value: the document contents
map(String key, String value) {
    for(word: value.split(" ")) {
        emit(word, 1)
} }
reduce(String key, Iterator<Integer> values) {
    for(value: values) {
        result += value;
}
    emit(key, result);
}

The map function in the above code would emit a single record for each word with the value 1, while the reduce function would just count all these entries and return the final sum for each word.

Note that the MapReduce framework is also based on a manager-worker architecture, which we will discuss in the next lesson.

Get hands-on with 1400+ tech skills courses.