Somewhere between teaching a BI Bootcamp class and wrestling my troop of kids, I promised myself I would get a blog post in this week. Luckily, I've had a few code heavy posts, so we will dial it back slightly as I briefly introduce MapReduce for Hadoop/HDInsight.
Most of the MapReduce posts I've seen to date, talk very specifically about how to implement a C# MapReduce job on HDInsight. Before we go there, I think it's a topic that deserves a somewhat more abstract/academic discussion so that we have a better understanding of MapReduce is (as well as is not).
Counting Words: An Introduction
Although, the MapReduce technique is not new in the world of parallel programming, the current iteration of MapReduce that most are familiar with what introduced by Google in approximately 2004. MapReduce is a programming model and parallel execution framework that allows you to process extremely large sets of data without all the messy details that would normally be involved. It accomplishes this in a number of ways but one of the keys is understanding that MapReduce is so powerful because it abstracts away the network/disk interface, parallelization, load balancing and fault tolerance that would normally need to be handled by the programmer and instead allows them to focus on the problem or question at hand.
There is plenty written on the technical specifics behind MapReduce and the topics like data locality, the overall MapReduce architecture and how the JobTracker/Task works. I don't want to gloss over it because I believe its important, but I'll point you to the resources section for more info rather than covering it here.
The concept of MapReduce is its simplest form is straight-forward consisting of a Map() and Reduce() (typically written in Java). The Map() function is responsible for extract interesting data from each record and the Reduce() function aggregates, transforms and/or filters it. To better illustrate this, let's look at a typical word count example (the equivalent of the "Hello World" example).
We will define our imaginary scope as a bunch of web pages that we would like to parse and do a simple html tag count for. To begin our diligent programmer will write a Map(K,V) function which will read in a record of data in the form of a key/value pair. In this example the key will be the URL of the document and the value will the html body or contents of the URL. The Map function will shred the html to extract and emit each html tag that it finds in a key/value pair. A pair will be emitted for each html tag using the tag as the key and the number '1' as the value. Note that at this point, no aggregation is occurring so multiple key/value pairs can be emitted by the Map function.
After the Map function has run, a Shuffle/Sort process is run by the MapReduce framework prior to the Reduce function. The Shuffle/Sort process groups the key/value pairs emitted by Map by key and generates a list of all values. This new intermediate key/list of values will be the input sent to the Reduce function.
Our ever diligent programmer can then use the Reduce function to aggregate, transform or even filter the results. In this example, we are simply aggregating all the values to get a count by tag type, so all the values for each key would be simply summed up. The output from the Reduce function would then be written to HDFS for consumption.
Behind the Scenes
Once you have your Map() and Reduce() functions complete, the set in the form of a job is submitted to the MapReduce framework where it is scheduled. The scheduler will distribute the Maps to the data (or as close as it can get). The number of Maps that will run will depended on the number of chunks of data that are available, where each chunk is normally 64/128MB in size. The number of Reduces that run are based on configuration settings.
The mappers will run in parallel on the local data, process and then store the intermediate results to local disk. When the mappers have finished, the Sort/Shuffle occurs sending the data from each mappers local intermediate store to the assigned reducer (based on key) to complete the aggregation which is first written locally again before being sent to HDFS.
Wait! There's More....
The Map() and Reduce() functions are the minimum requirements for a MapReduce job. There are a lot of other possibilities:
- Combiners: Is an efficieny function that can be specified to limit the amount of data sent to the shuffle/sort by acting as a map local reducer. Using the html tag example, each map would aggregate its own html tag count prior to the shuffle/sort which would limit the amount of data moved over the network to the reducer.
- Partitioner: Allows you to take control of the shuffle. The default behavior calculates a hash for the key to determine which subset or partition the the intermediate key belongs to.
- Sort: Allows for you to control how the data is sorted
- RecordReaders/Writers: Allows you to customize the slice of data or work that is provide to the mapper and well as how data is written out by the reducer.
- Input/OutputFormatters: Defines the types of files that are consumed or produced.
I hope this brief introduction to the world of MapReduce was useful in furthering your understanding of what it is and isn't. I believe that its important to understand the technology and its capabilities is important before you start tackling the technical/coding especially if you are going to be both effective and productive. In future blog post we will walk-through building both a Java MapReduce and C# MapReduce streaming applications.
Till next time!