GFS: Evolution on Fast-forward - Google, 2009

This is a discussion with Google engineer Sean Quinlan on GFS.

  • Single Master
    • Simplify the design problem
    • Short time to deliver the system
    • Metadata increased linearly with the increased storage
      • From “a few hundred terabytes up to petabytes, and then up to tens of petabytes”
      • Scanning the chunks for GC took more time
    • Metadata had to remain in memory
    • Every client has to talk to the master to open a file
    • Ended up having multiple cells in the DC
      • Each with their own master
      • Run multiple chunkservers on each machine with each going to a different master
      • Use “Name Spaces” to differentiate. I’m assuming Borg’s NS.
    • They put a lot of work into making the master more efficent

      “It’s atypical of Google to put a lot of work into tuning any one particular binary.”

  • 64 MB CHunk Size
    • Some users wanted to use GFS for small files <1 MB
    • Thins incurs significant overhead in the system
    • They put quotas on the number of files and the size of storage

      “The limit that people have ended up running into most has been, by far, the file-count quota.”

    • Smaller files also mean more seeking
  • Throughput vs. Latency
    • GFS was designed for high throughput, high latency is OK
    • BigTable, built on top of GFS has a commit log on GFS
    • To alleviate the intermittent delays to write to the log, BigTable has two open commit logs and switches if one is slow
    • Gmail uses a multihomed approach across DCs
  • Consistency
    • GFS does not guarantee that all of the replicas of a chunk are byte-wise identical
    • Duplicate records or half written records can appear
      • GFS deals with half written records
      • Application has to deal with duplicates
    • When you read you aren’t guaranteed to get the latest data
    • People did not expect this behavior so it was surprising
    • Quinlan believes the right approach is to just have one writer per file
  • Snapshot
    • They worked hard on a system to do great snapshots (really clones)
    • Quinlan notes that the feature is not used that often, despite it being really hard to build