I recently learned about how to perform joins with map reduce. Generally, you usually won’t have to do this since tools such as Hive or Pig exist which can do this much more easily. But it’s still a cool idea so I’ll discuss the overall concept here. Let’s take the following two tables which each exist over one or more files. They contain information regarding employees at a company.
Employees (PersonID / name / Age)
100 Bobby 24 101 Charles 54 102 Jenny 23 103 Oswald 41 104 Cindy 30
Pets (PersonID / Pet type / Pet name)
100 Dog Knuckles 101 Snake Jenny 103 Cat Uncle Naptime 102 Bird Mitzy 102 Bird Bessy 100 Dog Chuckles
We want to join these two tables to associate a person’s name with their pet’s name. So we need to perform a join using the PersonID. There are generally two ways to perform a join in map reduce: one uses both a mapper and a reducer while the other just uses a mapper. Both ways have their pros and cons.
Mapper and Reducer
So our mappers will read in all the data from both tables and spit out results using the PersonId as the key, regardless of which table it happens to be processing. The combined results from all mappers will look like this:
Key | Value 100 | 100 Bobby 24 101 | 101 Charles 54 102 | 102 Jenny 23 103 | 103 Oswald 41 104 | 104 Cindy 30 100 | 100 Dog Knuckles 101 | 101 Snake Jenny 103 | 103 Cat Uncle Naptime 102 | 102 Bird Mitzy 102 | 102 Bird Bessy 103 | 103 Cat Sir Borington 100 | 100 Dog Chuckles
Obviously, you can and should filter out unneeded information. I’m just not doing it here. So then all the values for each key go off to separate reducers. If we look at the reducer responsible for key 100, it will have values like so:
100 Bobby 24 100 Dog Knuckles 100 Dog Chuckles
This reducer now has the record of Bobby and his age. And all other records are his pets. It can output:
Bobby Knuckles Bobby Chuckles
… Or whatever output you want. You can easily perform any kind of join this way. It’s all a matter of filtering it how you want. In the above example, one reducer will get the key 104 which corresponds to Cindy, who doesn’t have any pets. So that reducer got the single value of “104 Cindy 30
“. If we want to perform an inner join, then that reducer can emit nothing. If we want an outer join or a left join, it could emit “Cindy null
” or some such. You really have a lot of flexibility.
Secondary sorting
One downside to the above example is that the line containing the person’s name may not be the first value in the list. This requires the reducer to examine each value to determine whether it contains the person information or more pet information. However, hadoop supports secondary sorting. Which means you can sort composite keys differently than you separated them. So for instance, in the above example, we could create a composite key in the mappers to include a little bit more information. Like so:
Key | Value 100#Emp | 100 Bobby 24 101#Emp | 101 Charles 54 102#Emp | 102 Jenny 23 103#Emp | 103 Oswald 41 104#Emp | 104 Cindy 30 100#Pet | 100 Dog Knuckles 101#Pet | 101 Snake Jenny 103#Pet | 103 Cat Uncle Naptime 102#Pet | 102 Bird Mitzy 102#Pet | 102 Bird Bessy 103#Pet | 103 Cat Sir Borington 100#Pet | 100 Dog Chuckles
Then the mapreducer can be configured to separate the keys based on the number before the ‘#’ but sort based on the entire key. This way, each reducer still gets all the values for a given PersonID, but the employee name will always be first since “#Emp” will show up before “#Pet” in sorting. Of course, Secondary sorting has much, much more potential than this. But like I said, this is an incredibly simple coverage of basic mapreduce joining.
Map Only
One major downside to using the Mapper and Reducer method, is that it create a lotof intermediate data. In fact, there is a line of intermediate data for every line read from the files. This will result a considerable network traffic within your hadoop cluster. So one other way to do a join is with mappers only. The only requirement is that all of the mappers have a complete copy of either one table or the other. This is an implausible order if both tables are massive, but ideal if one is massive and the other is rather small. So in our running example, lets assume that every mapper has a copy of the employees table. Then, as each mapper runs through their portion of the Pets table, they can compare each incoming entry for a match within the Employees table and emit any matches. This can be much faster and will consume minimal network traffic. But again, it will require that all mappers own a full version of one of the tables.
image via: africanbudgetsafaris.com