Cassandra: A Decentralized Structured Storage System - Facebook, 2009

[Paper] [Mirror]

If you haven’t read the Dynamo paper, I suggest you read it before this paper.

Originally for their Inbox Search service
- High write throughput
Data Model
- Operations are atomic under a row key
- Columns are grouped in column families like Bigtable
- They also include a new concept of super column families
  - A column family within a column family
- Application can dictate the sort order of columns in a super column family or column family
  - In Inbox Search, they sort columns by time so recent results appear first
Partitioning and Replication
- Want incremental scalability (like Dynamo)
- They use consistent hashing (like Dynamo)
  - Random positioning leads to no-uniform load
    - They note that it also ignores the heterogeneity of servers
  - They didn’t use the Dynamo technique of v-nodes
  - They instead analyze load information and move nodes around the ring
    - They argue for easy of implementation and allows them to “make very deterministic choices about the load balancing”
      - The first reason struck me as a little odd
- They have a coordinator node (a la Dynamo)
- They use read repair
- Multiple replication policies
  - I found this cool when I first used Cassandra in 2010
  - Policies:
    1. Rack Unaware
    2. Rack Aware
    3. Datacenter Aware
  - Now the open source version has pluggable policies for other environments like S3
Bootstrapping
- Uses seeds like Dynamo
- Chooses a random position in the ring
- Nodes are configured with a Cluster Name
  - A node will only talk to other nodes with the same Cluster Name
  - This saved my bacon many times when I was running multiple clusters on Amazon for a company I worked for
- Explicit addition and removal of nodes to the cluster (like Dynamo)
- At the time of the paper they were planning on allowing multiple replicas to speed up the initial bootstrap data transfer
Membership
- They use Zookeeper to elect a leader in charge of membership and ring management
- Anti-entropy Gossip protocol (based on Scuttlebutt)
  - Also used to transmit other system state
- They use a modified version of the Φ Accrual Failure Detector
  - Essentially, the model gives a real number for the level of suspicion that a node has failed
  - Adapts to current network conditions
  - I thought this was a cool idea!
Local Persistence
- Commit log on the node (on a dedicated disk for performance)
- In memory data structure that is persisted periodically
  - Per column family
  - Since the persisted files are immutable they do not need read locks
    - Though I’d imagine you’d have to take care during compaction with an atomic swap while ensure there are no readers to the old files
- They use bloom filters
- An index for the columns per key so they can jump to the right spot on disk
  - New index every 256K chunk boundary
Misc.
- They don’t discuss any kind of range scans. The current open source version does have range scans.
- They refer to a state machine for requests (sounds similar to Dynamo)
- The different modules of the server “rely on an event driven substrate where the message processing pipeline and the task pipeline are split into multiple stages” like SEDA
- System control messages are UDP
- Compaction looks for files that are close to each other in size
- In the Inbox Search case, they send a signal to the cluster when a user clicks into the search box to prime the cache for the user

Cassandra: A Decentralized Structured Storage System - Facebook, 2009

Other notes about database