Introduction To MapReduce

Sometimes datasets are a little larger than what we can easily process on a laptop. In these cases it’s often helpful to harness the power of many machines to do processing of data.

MapReduce

MapReduce is a programming paradigm invented by Google to make it easier to write distributed software using common programming constructs.

In python it’s not unusual to ‘map’ a function across a list:

>>> some_list = ['Hi my', 'name', 'is', 'Stephen Holiday']
>>> map(lambda x: x.swapcase(), some_list)
['hI MY', 'NAME', 'IS', 'sTEPHEN hOLIDAY']

Similarly you can reduce a list by iterating over it and returning a single value each time:

>>> reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) 
15

It turns out that these two concepts make it very easy for distributive systems to break up tasks. Since each map requires only the current item in the list, the above map operation could have occurred on 4 different machines.

Contrived Example

Let’s say we have a list of a million numbers that we wanted to sum together. Unfortunately they aren’t stored as integers but as strings. Let’s also say that converting these strings to numbers is a really intensive task.

Thankfully we have 4 machines that can do our work for us. In order to harness the power of the machines, we need to some how split up the tasks.

Converting a string to an integer does not require knowledge of the other strings that we wish to convert. It’s an isolated operation to convert the string. This means we can split up the list.

We can also note that summing the integers does not actually require knowing about all of the integers at once. We can apply the summing operation multiple times in order to get the right result. Addition is commutative and associative. This means that we can apply the addition in any order we want.

 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
=(1+2) + (3+(4+5)) + 6 + ((7+8) + (9+10))
etc...

We could use a worker queue or something similar, but that really requires a lot of customization. Using a MapReduce infrastructure, we can easily deploy new jobs to the cluster.

Map

So our map function could be:

def map_convert_to_int(input_string):
    return int(input_string)

Now, let’s say we had some distributed map reduce framework (cough hadoop), it could break up the list and send a portion of the list to each machine.

For simplicity, we’re only going to use the numbers 1-10, but this can easily be expanded.

list=['1','2','3','4','6','7','8','9','10']
machine_1_list=['2','1','7']
machine_2_list=['10','6']
machine_3_list=['3','5']
machine_4_list=['9','4','8']

So as you can see, each machine gets some subset of the list. The machine doesn’t care which ones it gets, the order, or the size since each map operation is independent.

Once each machine gets it’s list, (or as soon as it receives it’s first item) it can start running the map function. Each machine now has a list of integers.

Reduce

Now we need to sum up the numbers.

Our reducer could look like this:

def reduce_sum(accumulator,new_value):
    return accumulator+new_value

This would be run against the list, with an accumulator value. Since addition is commutative and associative, it doesn’t matter what subset or order we run the reducer.

So, each machine could run the reducer on it’s own list currently stored in memory and return the single sum. Then the master machine that started the job could just run the reducer on the list of results it got back.

Tada! Simple distributive programming.

Conclusion

This article covers the basics of MapReduce. Different implementations have different additional features, but the basics are still there.

This article is just an introduction and later I will write more articles on practical uses of MapReduce.

I hope this was interesting to you, let me know what you think.