The Google File System - Google, 2003

This paper is one of my favorites and is pretty easy to read.

  • [Paper] [Mirror]
  • Lecture slides from UTexas CS439
  • Observed:
    1. “Failures are the norm rather than the exception”
    2. Most files are huge
      • GBs or TBs, not KB
      • Even if a dataset is comprised of KB sized objects humans prefer to combine them into a large file
    3. Most files are appended, not random writes
  • Architecture
    • master
      • contains metadata for the files including:
        • namespace
        • permissions
        • file -> chunk mapping
        • locations of chunks
      • in charge of activities such as:
        • GC
        • Chunk Lease
        • Chunk Migration
      • single master to simplify design
      • all metadata is kept in memory
    • chunkservers
    • chunks
      • files are divided into fixed size chunks
      • identified by 64 bit handle
  • Read Example:
    1. Given a filename and offset, the client calculates the chunk index within the file.

      This is possible as chunk size is fixed.

    2. Client sends master request with filename and chunk index.
    3. Master replies with the chunk handle and locations of replicas
    4. Client caches this chunk handle and location information
    5. Client sends a read request to a chunkserver with the chunk handle and the byte range required.
    • The client doesn’t need to talk with the master until the cache expires
  • Operation Log
    • namespaces and file-chunk mapping are persisted the the operation log
    • the locations of chunks is determined by asking the chunkservers on startup

      “Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master.”

    • checkpointed and truncated to keep the log small
  • Mutations (two kinds)
    1. write
      • write to a specified location in the file
    2. record append
      • append a record “atomically at least once” wherever GFS wants
      • GFS can insert duplicates or padding
        • padding is filtered by client library
        • duplicates are filtered by the application
  • Chunk consistency guaranteed by:
    1. applying mutations to a chunk in the same order on all replicas
      • The master grants a chunkserver a lease to a replica, deemed the primary.
    2. chunk version numbers to differentiate between stale replicas
  • Write Example
    1. Client asks master for the primary replica. Master grants a lease to a replica if there is no primary.
    2. Master replies with the primary and the replicas. It need not contact the master again until the primary expires or is unavailable.
    3. Client sends data to all replicas in any order. Chunkservers cache data to separate data flow from control flow.
    4. Replicas acknowledge receipt to client.
    5. Client sends write request to primary once all acknowledgments are in.
    6. Primary assigns consecutive serial numbers to all mutations received.
    7. Primary forwards write request to all replicas.
    8. Replicas apply mutations in the same serial order.
    9. Replicas tell the primary that they have completed the mutation.
    10. Primary replies to the client.
      • Note: The paper notes that if an application requests a write that spans two chunks, the client library breaks the write into smaller writes. However, other clients may write in between the smaller writes leaving a consistent yet undefined state. See atomic append for a solution.
  • Data Flow Pipelining
    • This idea was especially interesting to me!
    • They wanted to utilize both the inbound and outbound bandwidth of their machines
    • Data is pushed along a chain of chunkservers instead of some tree like topology
    • Each chunkserver in the chain chooses the closest chunkserver to forward the data
    • The data is forwarded as soon as the first packets are received, not once the whole data part is received
  • Atomic Append
    • Similar to the write example:
      1. Client sends data to replicas of the last chunk of the file.
      2. Sends append request to primary.
      3. Primary determines if appending would exceed the chunk size.
        • If it would, is pads the chunk out to the chunk size, telling the replicas the same. Then notifies the client to retry on the next chunk.
        • Otherwise, the primary appends the data and tells the replicas to write the data to the same offset.
    • Note: If there is a failure, worst case is that a replica has written extra data. Therefore the next append will be after that hypothetical offset to protect the data. This means that the files are not bytewise identical. The client is left to deal with this inconsistent region.
  • Snapshots
    • Almost instantly
    • Uses copy-on-write
    • Master revokes outstanding leases on chunks for affected files.
      • Any further interaction requires intervention of the master which can then create a new copy of the chunk.
    • Master duplicates metadata for file or directory tree
      • This copy still points to the old chunks
    • Example write to snapshotted chunk
      1. Client asks master for primary
      2. Master notices that the reference count for the chunk is > 1. Master creates a new chunk handle for the existing chunk.
      3. Master asks each replica of the existing chunk to copy the data to a new chunk with the given handle.
        • This is done on the same machine to reduce network usage
      4. Request is now handled as before.
  • Data Integrity
    • Since each chunk replica is not guaranteed to be bytewise identical, each chunkserver is responsible for verify it’s chunks.
    • Each 64 KB block is given a checksum.
    • On a read the chunkserver verifies the checksum of the required blocks.
      • The chunkserver will not propagate unverified data.
      • If the checksum does not match, the server notifies the master.
      • The master will clone the chunk from another replica
    • The checksum is optimizes for writes at the end of the block
      • i.e they can incrementally update the checksum with just the new data
    • Chunkservers periodically verify rarely read chunks.
  • MISC:
    • They don’t have file caches on chunkservers or client as most use cases stream through the file.
    • Their lazy GC is simple and takes care of complex situations
    • Did not use erasure codes / parity codes at the time, but they figure it would not be hard given that load is mostly by appends and reads.
    • Shadow masters can provide read-only access to the files while the primary master is down. These shadow masters also poll chunkservers for their available chunks.