-
Notifications
You must be signed in to change notification settings - Fork 61
Hadoop MapReduce
#Requirements Hadoop installed with one NameNode and at least three DataNodes
#MapReduce Traditionally, most computations operated on a model where the data is brought from a external database to the machine that has the program on it. In addition to the parallelization that occurs from the MapReduce paradigm, Hadoop also reverses the traditional model by bringing the program to the data.
For example, when a MapReduce job like grep or word count is performed, the program is taken from the NameNode (which is called the JobTracker in the MapReduce context) to the DataNodes (which are called the TaskTracker in the MapReduce context) where the relevant data blocks are located. Hadoop is relatively efficient about balancing the workload, but the details aren’t too important at this point.
Let’s dig a bit deeper into the various stages of a MapReduce job, by looking at the standard “Hello World” program in Data Engineering, the word count. Even though it’s cliche, we’ll return to it several times since it’s very useful to see the same program repeated with different languages and tools. Additionally, it’s become a standard question for Data Engineering interviews.
Before we look at the code, let’s review the various stages of MapReduce to see how the data transforms at each stage. For this example, let’s assume that the following text:
So call a big meeting
Get everyone out out
Make every Who holler
Make every Who shout shout
is already copied into HDFS on a cluster with a NameNode and 12 DataNodes. For pedagogical reasons, let’s also assume that the data has been split into 3 blocks:
Block A | Block B | Block C |
---|---|---|
So call a big meeting | Get everyone out out Make every Who holler | Make every Who shout shout |
with each block replicated 3 times and distributed throughout the DataNodes.
When the MapReduce job begins, the JobTracker brings the WordCount.java program to the DataNodes. Only three of the TaskTrackers will receive the program (one per block of data) but the remaining TaskTrackers will be ready to execute the program from the beginning if one of the original TaskTrackers fails in the middle of the job.
Once WordCount.java
has been distributed, the following steps begin
Map Phase
- Record Reader - This takes the files in the input directory and turns them into records. Specifically, the files are read and turned into key-value pairs where the key is the location of each record and the value is the record content. For example, the block that begins “Get everyone…” will be split into two key-value pairs (one per line by default). The first will be (0, Get everyone out out) since 0 bytes have been read from the block so far and the second will be (21, Make every Who holler) since the second line begins after byte 21.
- Mapper - This is the important “business logic” of the Map Phase where a user-defined Map function is applied to each record. Specifically, the Mapper takes the key-value pairs from the record reader, and applies a function to them to get an intermediate key-value pair. In the example, each line is parsed into words and converted into a pair with the key and value being the word and 1, respectively.
- Combiner - While the computing power and storage abilities of computers has increased exponentially over the years, network speeds only grow linearly. This means that the worst bottleneck in the MapReduce process comes from moving the various pieces of data within the cluster network. The combiner is an optional, but important step to solve this problem by combining the data before sending it. The combiner is like a local reducer for each machine and often uses the same code as the Reducer. In our example, any duplicate words on a given machine will be summed up, so (shout, 2) will be sent once instead of (shout, 1) being moved on the network twice. This may not seem worth it on such a small data set, but if you have multiple repeating words it can drastically improve performance.
- Partitioner - The partitioner is the hand-off between the Map and Reduce phases. Specifically the partitioner splits the intermediate key-value pairs into separate “shards” so they can be processed by the Reduce phase.
After the Map phase is fully completed, the three TaskTrackers that were handling the Map phase begin handling the Reduce phase. In order for the data to be correctly Reduced, any duplicate keys on the cluster must be Reduced on the same machine. For example, the (Make, 1) from the blue machine and the (Make, 1) from the orange machine must be put together on the same machine for the Reduce phase. At the same time, the most efficient way to process the data is to distribute the workload in a roughly even manner.
Both of these task are accomplished with a clever trick: the MD5 hash function is applied to each key to generate a 32 hexadecimal digit number, which is then modded by the number of active TaskTrackers. For example, the MD5 hash for the key ‘every’, which is
83ab982dd08483187289a75163dc50fe (or 175019836766693582402690402259205640446 in decimal)
becomes 1 when modded by 3 (for the 3 active TaskTrackers), so each key-value pairs with ‘every’ will be assigned to the TaskTracker marked as 1. Similarly, the modded hash for the key ‘Who’ is 0, so each ‘Who’ key-value pair will be assigned to the TaskTracker marked as 0. Since the MD5 algorithm generates an even distribution of hashes, the workload should be roughly even while still ensuring that the same keys are handled by the same machine. Finally, the partition writes the intermediate key-value pairs to HDFS before the reduce phase begins.
Reduce Phase
- Shuffler - The Shuffler moves the intermediate key-value pairs to the machine where they will be reduced. Specifically, the pairs written to HDFS are moved to the actual TaskTracker that the Partitioner assigned them to. This is when the cluster’s network throughput peaks and can be a bottleneck in the MapReduce job.
2.Sort - The sort phase simply sorts the key-value pairs on each machine (alphabetically with upper-case preceding lower-case). This makes it easier for the Reducer to process since duplicate keys will be placed next to each other.
- Reducer - This is the important “business logic” of the Reduce Phase where a user-defined Reduce function is applied to the key-value pairs. Specifically the reduce function groups, filters, or aggregates the key-value pairs. In our example, any duplicate words on a given machine will be summed up, so (Make, 1) and (Make, 1) will be reduced to just (Make, 2). Note that in this example, this is the exact same code as the Combiner, but now all of the same keys are guaranteed to be on the same machines from the Partition and Shuffle phases.
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.