MapReduce: Simplified Data Processing on Large Clusters - Google, 2004

Since I’ve used Hadoop for a while, I’m concentrating on the interesting points from Google’s implementation.

“MapReduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance”

I believe this is emblematic of a lot of Google worker. Restricting the API to give greater guarantees.

  • Intermediate keys and values must be the same type as the output keys and values. This is not the case with the current Hadoop implementation.
  • The authors also mention that MapReduce could be applied to a shared memory system or NUMA multi-processor system
  • Notes on Execution
    • Intermediate values are written to local disk periodically, partitioned by key
      • It sends these locations to the master
    • Reducer learns about the locations from the master and reads the appropriate partitions via RPC directly from the Mapper
    • Once all of the data is received, the reducer sorts the data, possibly with an external sort
      • I wonder if some speculative partial reduction work could be done or some sort of heap style sort that could start immediately
      • However, it’s probably better use of the machine to have it running other work instead
    • Reducer output goes to GFS
      • Once the reducer finishes, it atomically renames the output file
  • Master does not currently checkpoint, the user must rerun the job
    • The authors say it would be easy
  • At the time, network bandwidth was scarce
    • They had the cluster scheduler assign map tasks on the same machines that held the data in GFS or very close to the same machine
    • Now that Google has a better network fabric, I wonder how it changes things
  • They have the ability to skip bad records
    • When an error occurs (i.e. segfault), they have a handler that sends a quick UDP packet to the master with the offset
    • If the master receives this message from two workers, when it reschedules the task, it will notify the worker to skip the record
    • This is super cool!
      • I didn’t know this existed, but it does!
      • Interestingly, the JavaDoc steals directly from the paper in the description
  • They recognize the benefit of having a local version where the user can debug with GDB
  • Counter values are piggybacked on the heartbeat response