Most of us know that hadoop mapreduce is made up of mappers and reducers. A map task runs on a task tracker. Then all the data for each key is collected from all the mappers and sent to another task tracker for reducing, one reduce task per key. But what slightly less than most of us know about are combiners. Combiners are an optimization that can occur after mapping but before the data is segregated to other machines based on key. Combiners often perform the exact same function as reducers, but only on the subset of data created on one mapper. This allows the task tracker an opportunity to reduce the size of the intermediate data it must send along to the reducers.
For instance, if we take the ubiquitous word count example. Two mappers may produce results like this:
Mapper A | Mapper B |
X - 1 Y - 1 Z - 1 X - 1 |
X - 1 X - 1 Z - 1 Y - 1 Y - 1 |
---|
All those key-value pairs will need to passed to the reducers to tabulate the values. But suppose the reducer is also used as a combiner (which is quite often the case) and suppose it gets called on both results before they’re passed along:
Mapper A | Mapper B |
X - 2 Y - 1 Z - 1 |
X - 2 Z - 1 Y - 2 |
---|
The traffic load has been reduced. Now all that’s left to do is call the reducers on the keys across all map results to produce:
X - 4 Z - 2 Y - 3 |
---|
An important point to keep in mind is that the combiner is not always called, even when you assign one. The mapper will generally only call the combiner if the intermediate it’s producing is getting large, perhaps to the point that it must be written to disk before it can be sent. That’s why it’s important to make sure that the combiner does not change the inherit form of the data it processes. It must produce the same sort of content that it reads in. In the above example, the combiners read (word – sum) pairs and wrote out (word – sum) pairs.
image via: wpclipart.com