Manual Joins in Hadoop

I recently learned about how to perform joins with map reduce. Generally, you usually won’t have to do this since tools such as Hive or Pig exist which can do this much more easily. But it’s still a cool idea so I’ll discuss the overall concept here. Let’s take the following two tables which each exist over one or more files. They contain information regarding employees at a company.

Employees (PersonID / name / Age)

100   Bobby   24
101   Charles 54
102   Jenny   23
103   Oswald  41
104   Cindy   30

Pets (PersonID / Pet type / Pet name)

100   Dog   Knuckles
101   Snake Jenny
103   Cat   Uncle Naptime
102   Bird  Mitzy
102   Bird  Bessy
100   Dog   Chuckles

We want to join these two tables to associate a person’s name with their pet’s name. So we need to perform a join using the PersonID. There are generally two ways to perform a join in map reduce: one uses both a mapper and a reducer while the other just uses a mapper. Both ways have their pros and cons.

Mapper and Reducer

So our mappers will read in all the data from both tables and spit out results using the PersonId as the key, regardless of which table it happens to be processing. The combined results from all mappers will look like this:

Key | Value
100 | 100   Bobby   24
101 | 101   Charles 54
102 | 102   Jenny   23
103 | 103   Oswald  41
104 | 104   Cindy   30
100 | 100  Dog   Knuckles
101 | 101  Snake Jenny
103 | 103  Cat   Uncle Naptime
102 | 102  Bird  Mitzy
102 | 102  Bird  Bessy
103 | 103  Cat   Sir Borington
100 | 100  Dog   Chuckles

Obviously, you can and should filter out unneeded information. I’m just not doing it here. So then all the values for each key go off to separate reducers. If we look at the reducer responsible for key 100, it will have values like so:

100  Bobby   24
100 Dog   Knuckles
100 Dog   Chuckles

This reducer now has the record of Bobby and his age. And all other records are his pets. It can output:

Bobby Knuckles
Bobby Chuckles

… Or whatever output you want. You can easily perform any kind of join this way. It’s all a matter of filtering it how you want. In the above example, one reducer will get the key 104 which corresponds to Cindy, who doesn’t have any pets. So that reducer got the single value of “104 Cindy 30“. If we want to perform an inner join, then that reducer can emit nothing. If we want an outer join or a left join, it could emit “Cindy null” or some such. You really have a lot of flexibility.

Secondary sorting

One downside to the above example is that the line containing the person’s name may not be the first value in the list. This requires the reducer to examine each value to determine whether it contains the person information or more pet information. However, hadoop supports secondary sorting. Which means you can sort composite keys differently than you separated them. So for instance, in the above example, we could create a composite key in the mappers to include a little bit more information. Like so:

Key | Value
100#Emp | 100   Bobby   24
101#Emp | 101   Charles 54
102#Emp | 102   Jenny   23
103#Emp | 103   Oswald  41
104#Emp | 104   Cindy   30
100#Pet | 100  Dog   Knuckles
101#Pet | 101  Snake Jenny
103#Pet | 103  Cat   Uncle Naptime
102#Pet | 102  Bird  Mitzy
102#Pet | 102  Bird  Bessy
103#Pet | 103  Cat   Sir Borington
100#Pet | 100  Dog   Chuckles

Then the mapreducer can be configured to separate the keys based on the number before the ‘#’ but sort based on the entire key. This way, each reducer still gets all the values for a given PersonID, but the employee name will always be first since “#Emp” will show up before “#Pet” in sorting. Of course, Secondary sorting has much, much more potential than this. But like I said, this is an incredibly simple coverage of basic mapreduce joining.

Map Only

One major downside to using the Mapper and Reducer method, is that it create a lotof intermediate data. In fact, there is a line of intermediate data for every line read from the files. This will result a considerable network traffic within your hadoop cluster. So one other way to do a join is with mappers only. The only requirement is that all of the mappers have a complete copy of either one table or the other. This is an implausible order if both tables are massive, but ideal if one is massive and the other is rather small. So in our running example, lets assume that every mapper has a copy of the employees table. Then, as each mapper runs through their portion of the Pets table, they can compare each incoming entry for a match within the Employees table and emit any matches. This can be much faster and will consume minimal network traffic. But again, it will require that all mappers own a full version of one of the tables.

image via: africanbudgetsafaris.com