Introduction to MapReduce
Let's find the inspiration behind MapReduce. Also, look into the goal and the working of MapReduce.
We'll cover the following
MapReduce is a framework for batch data processing originally developed internally in Google by Dean et al… It was later incorporated in the wider Apache Hadoop framework.
Inspiration
The framework draws inspiration from the field of functional programming and is based on the following main idea:
Idea
Many real-word computations can be expressed with the use of two main primitive functions, map and reduce.
Map
The map
function processes a set of key-value pairs and produces another set of intermediate key-value pairs as output.
Reduce
The reduce
function receives all the values for each key and returns a single value, essentially merging all the values according to some logic
An important property of map and reduce functions
The map
and reduce
functions can easily be parallelized and run across multiple machines for different parts of the dataset. As a result,
- The application code is responsible for defining these two methods.
- The framework is responsible for partitioning the data, scheduling the program’s execution across multiple nodes, handling node failures, and managing the required inter-machine communication.
Working of MapReduce
Let’s see a typical example to understand better how this programming model practically works.
Suppose we have a huge collection of documents (e.g., webpages), and we need to count the number of occurrences for each word. To achieve that via MapReduce, we would use the following functions:
// key: the document name
// value: the document contents
map(String key, String value) {
for(word: value.split(" ")) {
emit(word, 1)
} }
reduce(String key, Iterator<Integer> values) {
for(value: values) {
result += value;
}
emit(key, result);
}
The map
function in the above code would emit a single record for each word with the value 1, while the reduce
function would just count all these entries and return the final sum for each word.
Note that the MapReduce framework is also based on a manager-worker architecture, which we will discuss in the next lesson.
Get hands-on with 1400+ tech skills courses.