Exploring MapReduce Runs
Explore how changing the number of reducers influences MapReduce job output. Learn to run jobs with multiple reducers, examine output files, and navigate the Job History Server UI to track job execution on a Hadoop cluster.
We'll cover the following...
Exploring MapReduce Runs
In this lesson, we vary the number of reducers and see how that affects the output. Here we’ll increase the number of reducers to 3.
Connect to the terminal below and execute the commands. Each command is explained later in the lesson. You can read the explanation first and then execute the commands in the terminal.
-
Start-up the hadoop cluster by running the command below:
/DataJek/startHadoop.shUpload the data file to HDFS.
hdfs dfs -copyFromLocal/DataJek/cars.data/ -
Now run the job with 3 reducers using the command below. The driver program takes in the number of reducers as the third argument.
hadoop jar
JarDependencies/MapReduceJarDependencies/MapReduce-1.0-SNAPSHOT.jar io.datajek.mapreduce.Driver /cars.data /MultipleReducers 3
If you run the program multiple times, you need to change the output directory /MultipleReducers, as Hadoop will not overwrite existing HDFS directories.
-
Next, examine the output directory using the command below:
hdfs dfs -ls /MultipleReducersYou should observe the following outcome shown in the screen-shot.
There are three sub-directories named according to the pattern part-r-0000*. The ‘r’ in the name indicates that it is the output from a reducer, while in case of a map task, the directory would have an ‘m’ instead of an ‘r’ in the name. We find three files because we requested three reducers.
-
Now execute the following commands one by one in the terminal. Examine the contents of the outputs produced by each reducer:
hdfs dfs -text /MultipleReducers/part-r-00000 hdfs dfs -text /MultipleReducers/part-r-00001 hdfs dfs -text /MultipleReducers/part-r-00002You should see three files in the result directory, one for each reducer.
- The results should be similar to the following screen-shot. Note how each reducer processes all the values for a single key. In our run, the third reducer processed the key ford. First reducer produces:
The second reducer produces:
The third reducer processes a single key:
If you click on the button in the widget below, you can interact with the UI of the Job History Server (JHS). Though not extensively discussed, JHS keeps an archive of all the jobs run on the Hadoop cluster. The above job is run only once. The first time you click the widget, it may take some time for the job to complete and show up on the JHS web UI which runs on port 19888. The job will also show up on the YARN web UI but we want to demonstrate the same on JHS UI.
Note, that the UI will not load in the widget below. Click on the URL link beside the message “Your app can be found at” or wait for the Firefox message to load “Open Site in New Window”, and click on that. The Job History Server UI may be slow to load, so be patient.