Category Archives: Big Data

Manual Joins in Hadoop

I recently learned about how to perform joins with map reduce. Generally, you usually won’t have to do this since tools such as Hive or Pig exist which can do this much more easily. But it’s still a cool idea so I’ll discuss the overall concept here. Let’s take the following two tables which each exist over one or more files. They contain information regarding employees at a company.

Employees (PersonID / name / Age)

100   Bobby   24
101   Charles 54
102   Jenny   23
103   Oswald  41
104   Cindy   30

Pets (PersonID / Pet type / Pet name)

100   Dog   Knuckles
101   Snake Jenny
103   Cat   Uncle Naptime
102   Bird  Mitzy
102   Bird  Bessy
100   Dog   Chuckles

We want to join these two tables to associate a person’s name with their pet’s name. So we need to perform a join using the PersonID. There are generally two ways to perform a join in map reduce: one uses both a mapper and a reducer while the other just uses a mapper. Both ways have their pros and cons.

Mapper and Reducer

So our mappers will read in all the data from both tables and spit out results using the PersonId as the key, regardless of which table it happens to be processing. The combined results from all mappers will look like this:

Key | Value
100 | 100   Bobby   24
101 | 101   Charles 54
102 | 102   Jenny   23
103 | 103   Oswald  41
104 | 104   Cindy   30
100 | 100  Dog   Knuckles
101 | 101  Snake Jenny
103 | 103  Cat   Uncle Naptime
102 | 102  Bird  Mitzy
102 | 102  Bird  Bessy
103 | 103  Cat   Sir Borington
100 | 100  Dog   Chuckles

Obviously, you can and should filter out unneeded information. I’m just not doing it here.  So then all the values for each key go off to separate reducers. If we look at the reducer responsible for key 100, it will have values like so:

100  Bobby   24
100 Dog   Knuckles
100 Dog   Chuckles

This reducer now has the record of Bobby and his age. And all other records are his pets. It can output:

Bobby Knuckles
Bobby Chuckles

… Or whatever output you want. You can easily perform any kind of join this way. It’s all a matter of filtering it how you want. In the above example, one reducer will get the key 104 which corresponds to Cindy, who doesn’t have any pets. So that reducer got the single value of “104   Cindy   30“. If we want to perform an inner join, then that reducer can emit nothing. If we want an outer join or a left join, it could emit “Cindy  null” or some such. You really have a lot of flexibility.

Secondary sorting

One downside to the above example is that the line containing the person’s name may not be the first value in the list. This requires the reducer to examine each value to determine whether it contains the person information or more pet information. However, hadoop supports secondary sorting. Which means you can sort composite keys differently than you separated them. So for instance, in the above example, we could create a composite key in the mappers to include a little bit more information. Like so:

Key | Value
100#Emp | 100   Bobby   24
101#Emp | 101   Charles 54
102#Emp | 102   Jenny   23
103#Emp | 103   Oswald  41
104#Emp | 104   Cindy   30
100#Pet | 100  Dog   Knuckles
101#Pet | 101  Snake Jenny
103#Pet | 103  Cat   Uncle Naptime
102#Pet | 102  Bird  Mitzy
102#Pet | 102  Bird  Bessy
103#Pet | 103  Cat   Sir Borington
100#Pet | 100  Dog   Chuckles

Then the mapreducer can be configured to separate the keys based on the number before the ‘#’ but sort based on the entire key. This way, each reducer still gets all the values for a given PersonID, but the employee name will always be first since “#Emp” will show up before “#Pet” in sorting. Of course, Secondary sorting has much, much more potential than this. But like I said, this is an incredibly simple coverage of basic mapreduce joining.

Map Only

One major downside to using the Mapper and Reducer method, is that it create a lotof intermediate data. In fact, there is a line of intermediate data for every line read from the files. This will result a considerable network traffic within your hadoop cluster. So one other way to do a join is with mappers only. The only requirement is that all of the mappers have a complete copy of either one table or the other. This is an implausible order if both tables are massive, but ideal if one is massive and the other is rather small. So in our running example, lets assume that every mapper has a copy of the employees table. Then, as each mapper runs through their portion of the Pets table, they can compare each incoming entry for a match within the Employees table and emit any matches. This can be much faster and will consume minimal network traffic. But again, it will require that all mappers own a full version of one of the tables.

image via: africanbudgetsafaris.com

Combiners: The Optional Step to MapReduce

Most of us know that hadoop mapreduce is made up of mappers and reducers. A map task runs on a task tracker. Then all the data for each key is collected from all the mappers and sent to another task tracker for reducing, one reduce task per key. But what slightly less than most of us know about are combiners. Combiners are an optimization that can occur after mapping but before the data is segregated to other machines based on key. Combiners often perform the exact same function as reducers, but only on the subset of data created on one mapper. This allows the task tracker an opportunity to reduce the size of the intermediate data it must send along to the reducers.

For instance, if we take the ubiquitous word count example. Two mappers may produce results like this:

Mapper A Mapper B
X - 1
Y - 1
Z - 1
X - 1
X - 1
X - 1
Z - 1
Y - 1
Y - 1

All those key-value pairs will need to passed to the reducers to tabulate the values. But suppose the reducer is also used as a combiner (which is quite often the case) and suppose it gets called on both results before they’re passed along:

Mapper A Mapper B
X - 2
Y - 1
Z - 1
X - 2
Z - 1
Y - 2

The traffic load has been reduced. Now all that’s left to do is call the reducers on the keys across all map results to produce:

X - 4
Z - 2
Y - 3

An important point to keep in mind is that the combiner is not always called, even when you assign one. The mapper will generally only call the combiner if the intermediate it’s producing is getting large, perhaps to the point that it must be written to disk before it can be sent. That’s why it’s important to make sure that the combiner does not change the inherit form of the data it processes. It must produce the same sort of content that it reads in. In the above example, the combiners read (word – sum) pairs and wrote out (word – sum) pairs.

image via: wpclipart.com

Speculative Execution: Proceed with Caution (or Not at All)

speculative execution

When a job tracker receives a map reduce job, it will divvy out tasks to several task trackers in order to complete the job. If any of those tasks fails for whatever reason (perhaps they threw an exception), then it’s up to the job tracker to restart the job on another slave. This process can occur up to three times before the job tracker gives up. But what happens if a task doesn’t fail, but it doesn’t succeed either? What if it just hangs? Perhaps that map task received an extra large or extra tough block to work with. Maybe some other application on that task tracker is running and it’s hogging the entire CPU. Maybe the task tracker has entered an infinite loop. Either way, the task tracker continues to check in from time to time, which prevents it from being killed outright, but it just isn’t finishing. The job tracker can’t possibly know why this task tracker is taking longer nor can it know when or if it will finish. What does the job tracker do?

Speculative Execution!

Without shutting down the first task tracker, it goes to another task tracker and gives it the same job. Then it’s a race. Whoever finishes first is the one that gets to submit its results. The other is killed (a most cutthroat race). That’s it.

Speculative execution isn’t always appropriate. In fact, some people recommend that you disable it for reduce jobs entirely. Why? Continue reading Speculative Execution: Proceed with Caution (or Not at All)

Traversing Graphs with MapReduce

Hadoop can be used to perform breadth-first searches through graphs. One such way is done through a series of mapreduce jobs where each mapreduce is another layer of the breadth first search. Here is a very high-level explanation of what I mean. Suppose we have the simple graph:

 E <-- C <-- F
 ^     ^     ^
 |     |     |
 A --> B --> D

This data would likely be represented in our Hadoop cluster as list of connections, like: Continue reading Traversing Graphs with MapReduce

For the Data Scientists: 5 Upcoming Big Data Conferences You Shouldn’t Miss

Big data is a big deal right now, and it’s only going to become a bigger deal in the future, so it makes sense to learn about as many of its aspects as you can, as quickly as you can. Or pick one and learn it very well. Or don’t pick any, if you are a staunch believer in the shelf-life of traditional data warehouses. From a machine learning deep-dive to an open-source buffet,  the following five conferences provide educational and networking opportunities for both the specialists and renaissance persons among you. Attending a cool one I’ve missed? Let me know in the comments!

 

 

Boston SharePoint Salon: Go Big or Go Home

Big as in data, home as in “out of business.” Because there’s only going to be more data, and people are finally realizing that not only can it be sliced and diced and visualized in formats comprehensible to the business analyst—it needs to be. The questions are: how should it be stored and queried and where should the visible representations of these queries be displayed?

Hadoop, Apache’s open source, distributed computing and storage framework based on Google’s MapReduce model is one answer to the first question. Or you could buy a supercomputer, but, those are kind of expensive! And less fun to say!  As for the second question, of course the answer depends on the type of data. As this is a SharePoint-focused Salon, though, I’m going to nominate SharePoint as one potential answer. Why? Well, Microsoft’s new Big Data Solution will put enterprise Hadoop solutions on Azure and Windows Server, including the now available SQL Server Connector, which lets you transfer data between Hadoop and SQL Server.  So, if you plan on upgrading to SQL Server 2012, you’ll be able to access data stored in Hadoop from SharePoint, and do all your slicing and dicing and displaying in PowerPivot and Power View. Presumably.

Interesting, no? We think so. If you agree, please join us at Tico (Berklee Street) this Thursday, from 7 to about 9:30 pm. You can RSVP here, or email me! And if you can’t make it, but know someone whom you think should attend, please spread the word!

30 Hadoop and Big Data Spelunkers Worth Following

Understanding the basic purpose of Hadoop is easy: it offers a way to quickly store, process and extract deliverable meaning(s) from vast datasets. It does this by breaking the datasets up into commodity-server-sized chunks, replicating these to reduce failure, and sending them out to a connected web (cluster) of commodity servers (nodes) . Understanding how it can integrate with the current big data landscape and may integrate with the future one is a little harder—for that, I’ve turned to the experts. Luckily for me, and for you, if you’re in my boat, many of them maintain active twitter and blogging presences. Even more luckily, the quality and clarity of writing is really, really high. The following list is by no means exhaustive, but poking into the thoughts of even a few can elucidate everything from machine learning to data modeling and distributed systems.

1.) Hilary Mason

Continue reading 30 Hadoop and Big Data Spelunkers Worth Following