MapReduce: Simplified Data Processing on Large Clusters - Google, 2004

[Paper] [Mirror]
[Slides]

Since I’ve used Hadoop for a while, I’m concentrating on the interesting points from Google’s implementation.

“MapReduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance”

I believe this is emblematic of a lot of Google worker. Restricting the API to give greater guarantees.

Intermediate keys and values must be the same type as the output keys and values. This is not the case with the current Hadoop implementation.
The authors also mention that MapReduce could be applied to a shared memory system or NUMA multi-processor system
Notes on Execution
- Intermediate values are written to local disk periodically, partitioned by key
  - It sends these locations to the master
- Reducer learns about the locations from the master and reads the appropriate partitions via RPC directly from the Mapper
- Once all of the data is received, the reducer sorts the data, possibly with an external sort
  - I wonder if some speculative partial reduction work could be done or some sort of heap style sort that could start immediately
  - However, it’s probably better use of the machine to have it running other work instead
- Reducer output goes to GFS
  - Once the reducer finishes, it atomically renames the output file
Master does not currently checkpoint, the user must rerun the job
- The authors say it would be easy
At the time, network bandwidth was scarce
- They had the cluster scheduler assign map tasks on the same machines that held the data in GFS or very close to the same machine
- Now that Google has a better network fabric, I wonder how it changes things
They have the ability to skip bad records
- When an error occurs (i.e. segfault), they have a handler that sends a quick UDP packet to the master with the offset
- If the master receives this message from two workers, when it reschedules the task, it will notify the worker to skip the record
- This is super cool!
  - I didn’t know this existed, but it does!
  - Interestingly, the JavaDoc steals directly from the paper in the description
They recognize the benefit of having a local version where the user can debug with GDB
Counter values are piggybacked on the heartbeat response

MapReduce: Simplified Data Processing on Large Clusters - Google, 2004

Other notes about distributed computation