Traversing Graphs with MapReduce

Hadoop can be used to perform breadth-first searches through graphs. One such way is done through a series of mapreduce jobs where each mapreduce is another layer of the breadth first search. Here is a very high-level explanation of what I mean. Suppose we have the simple graph:

 E <-- C <-- F
 ^     ^     ^
 |     |     |
 A --> B --> D

This data would likely be represented in our Hadoop cluster as list of connections, like:

Node	Connections	Distance
A	B, E	0
B	C, D	x
C	E	x
D	F	x
E		x
F	C	x

The only distance we are sure of in the beginning is that the distance between ‘A’ and ‘A’ is zero. All other points we’ll put as x, for unknown. So in the first pass, we feed the table into our mappers and we produce a key-value pair for each node and connection. So for instance, when a mapper reads in the line for Node ‘A’, it will emit three pairs:

B, E

becomes

A	B, E	0
B	_	1
E	_	1

The first effectively reiterates what is already known about ‘A.’ The second two are new information about ‘B’ and ‘E.’ Since they are connected to ‘A,’ their distance becomes the distance to ‘A’ plus 1, which makes them 1 (0 + 1 = 1). That single line we’re parsing does not contain any information about what ‘B’ and ‘E’ are connected to, so we leave those fields empty. Once the mappers process all the lines, they’ll produce an output like this (but not necessarily in this order):

Mapper Output

A	B, E	0
B	_	1
E	_	1
B	C, D	x
C	_	x
D	_	x
C	E	x
E	_	x
D	F	x
F	_	x
E		x
F	C	x
C	_	x

Whenever we process a node we don’t have the distance to, we won’t know the distance to any of its connected nodes either, so we’ll leave them blank for now. Now we separate out that data by node and send it to the reducers. The reducers will strip out all redundant information and reduce all values for a node back down to a single entry using the smallest known value for “distance” known. So the node that processes ‘B’ will get:

B	_	1
B	C, D	x

which will then be reduced to:

C, D

Once all this data has been reduced and recombined, the resulting table will look like this:

Reducer Output

Node	Connections	Distance
A	B, E	0
B	C, D	1
C	E	x
D	F	x
E		1
F	C	x

We have now completed a single iteration. We know the distance for all nodes within one jump from ‘A.’ Not too impressive, but every iteration we do will increase our knowledge to the next depth of the tree. We can keep doing this up until the graph stops changing. Obviously there is massive room for improvement, such as not needing to re-expand a node if we’ve already expanded it once and its distance has not changed. Obviously as you start to do more complicated calculations than just distance in an unweighted graph, you’ll need to consider possible scenarios where this strategy needs to be adjusted. But this example gets the idea across.

Shotgun SharePoint Episode 38: Elephants on Parade

For the Data Scientists: 5 Upcoming Big Data Conferences You Shouldn't Miss

Mo' Data, Mo' Problems, E03: DynamoDB vs. HBase

SoftArtisans

Traversing Graphs with MapReduce

Related posts:

Blogged