This paper is one of my favorites and is pretty easy to read.
- [Paper]
[Mirror]
- Lecture slides
from UTexas CS439
- Observed:
- “Failures are the norm rather than the exception”
- Most files are huge
- GBs or TBs, not KB
- Even if a dataset is comprised of KB sized objects humans prefer to
combine them into a large file
- Most files are appended, not random writes
- Architecture
- master
- contains metadata for the files including:
- namespace
- permissions
- file -> chunk mapping
- locations of chunks
- in charge of activities such as:
- GC
- Chunk Lease
- Chunk Migration
- single master to simplify design
- all metadata is kept in memory
- chunkservers
- chunks
- files are divided into fixed size chunks
- identified by 64 bit handle
- Read Example:
-
Given a filename and offset, the client calculates the chunk index within
the file.
This is possible as chunk size is fixed.
- Client sends master request with filename and chunk index.
- Master replies with the chunk handle and locations of replicas
- Client caches this chunk handle and location information
- Client sends a read request to a chunkserver with the chunk handle and
the byte range required.
- The client doesn’t need to talk with the master until the cache expires
- Operation Log
- Mutations (two kinds)
- write
- write to a specified location in the file
- record append
- append a record “atomically at least once” wherever GFS wants
- GFS can insert duplicates or padding
- padding is filtered by client library
- duplicates are filtered by the application
- Chunk consistency guaranteed by:
- applying mutations to a chunk in the same order on all replicas
- The master grants a chunkserver a lease to a replica, deemed the
primary.
- chunk version numbers to differentiate between stale replicas
- Write Example
- Client asks master for the primary replica.
Master grants a lease to a replica if there is no primary.
- Master replies with the primary and the replicas.
It need not contact the master again until the primary expires or is
unavailable.
- Client sends data to all replicas in any order.
Chunkservers cache data to separate data flow from control flow.
- Replicas acknowledge receipt to client.
- Client sends write request to primary once all acknowledgments are in.
- Primary assigns consecutive serial numbers to all mutations received.
- Primary forwards write request to all replicas.
- Replicas apply mutations in the same serial order.
- Replicas tell the primary that they have completed the mutation.
- Primary replies to the client.
- Note: The paper notes that if an application requests a write that
spans two chunks, the client library breaks the write into smaller writes.
However, other clients may write in between the smaller writes leaving a
consistent yet undefined state. See atomic append for a solution.
- Data Flow Pipelining
- This idea was especially interesting to me!
- They wanted to utilize both the inbound and outbound bandwidth of their
machines
- Data is pushed along a chain of chunkservers instead of some tree like
topology
- Each chunkserver in the chain chooses the closest chunkserver to forward
the data
- The data is forwarded as soon as the first packets are received, not once
the whole data part is received
- Atomic Append
- Similar to the write example:
- Client sends data to replicas of the last chunk of the file.
- Sends append request to primary.
- Primary determines if appending would exceed the chunk size.
- If it would, is pads the chunk out to the chunk size, telling the
replicas the same. Then notifies the client to retry on the next
chunk.
- Otherwise, the primary appends the data and tells the replicas to
write the data to the same offset.
- Note: If there is a failure, worst case is that a replica has written
extra data. Therefore the next append will be after that hypothetical
offset to protect the data. This means that the files are not bytewise
identical. The client is left to deal with this inconsistent region.
- Snapshots
- Almost instantly
- Uses copy-on-write
- Master revokes outstanding leases on chunks for affected files.
- Any further interaction requires intervention of the master which
can then create a new copy of the chunk.
- Master duplicates metadata for file or directory tree
- This copy still points to the old chunks
- Example write to snapshotted chunk
- Client asks master for primary
- Master notices that the reference count for the chunk is > 1.
Master creates a new chunk handle for the existing chunk.
- Master asks each replica of the existing chunk to copy the data to a
new chunk with the given handle.
- This is done on the same machine to reduce network usage
- Request is now handled as before.
- Data Integrity
- Since each chunk replica is not guaranteed to be bytewise identical, each
chunkserver is responsible for verify it’s chunks.
- Each 64 KB block is given a checksum.
- On a read the chunkserver verifies the checksum of the required blocks.
- The chunkserver will not propagate unverified data.
- If the checksum does not match, the server notifies the master.
- The master will clone the chunk from another replica
- The checksum is optimizes for writes at the end of the block
- i.e they can incrementally update the checksum with just the new data
- Chunkservers periodically verify rarely read chunks.
- MISC:
- They don’t have file caches on chunkservers or client
as most use cases stream through the file.
- Their lazy GC is simple and takes care of complex situations
- Did not use erasure codes / parity codes at the time, but they figure it
would not be hard given that load is mostly by appends and reads.
- Shadow masters can provide read-only access to the files while the primary
master is down. These shadow masters also poll chunkservers for their
available chunks.