Cassandra: A Decentralized Structured Storage System - Facebook, 2009

If you haven’t read the Dynamo paper, I suggest you read it before this paper.

  • Originally for their Inbox Search service
    • High write throughput
  • Data Model
    • Operations are atomic under a row key
    • Columns are grouped in column families like Bigtable
    • They also include a new concept of super column families
      • A column family within a column family
    • Application can dictate the sort order of columns in a super column family or column family
      • In Inbox Search, they sort columns by time so recent results appear first
  • Partitioning and Replication
    • Want incremental scalability (like Dynamo)
    • They use consistent hashing (like Dynamo)
      • Random positioning leads to no-uniform load
        • They note that it also ignores the heterogeneity of servers
      • They didn’t use the Dynamo technique of v-nodes
      • They instead analyze load information and move nodes around the ring
        • They argue for easy of implementation and allows them to “make very deterministic choices about the load balancing”
          • The first reason struck me as a little odd
    • They have a coordinator node (a la Dynamo)
    • They use read repair
    • Multiple replication policies
      • I found this cool when I first used Cassandra in 2010
      • Policies:
        1. Rack Unaware
        2. Rack Aware
        3. Datacenter Aware
      • Now the open source version has pluggable policies for other environments like S3
  • Bootstrapping
    • Uses seeds like Dynamo
    • Chooses a random position in the ring
    • Nodes are configured with a Cluster Name
      • A node will only talk to other nodes with the same Cluster Name
      • This saved my bacon many times when I was running multiple clusters on Amazon for a company I worked for
    • Explicit addition and removal of nodes to the cluster (like Dynamo)
    • At the time of the paper they were planning on allowing multiple replicas to speed up the initial bootstrap data transfer
  • Membership
    • They use Zookeeper to elect a leader in charge of membership and ring management
    • Anti-entropy Gossip protocol (based on Scuttlebutt)
      • Also used to transmit other system state
    • They use a modified version of the Φ Accrual Failure Detector
      • Essentially, the model gives a real number for the level of suspicion that a node has failed
      • Adapts to current network conditions
      • I thought this was a cool idea!
  • Local Persistence
    • Commit log on the node (on a dedicated disk for performance)
    • In memory data structure that is persisted periodically
      • Per column family
      • Since the persisted files are immutable they do not need read locks
        • Though I’d imagine you’d have to take care during compaction with an atomic swap while ensure there are no readers to the old files
    • They use bloom filters
    • An index for the columns per key so they can jump to the right spot on disk
      • New index every 256K chunk boundary
  • Misc.
    • They don’t discuss any kind of range scans. The current open source version does have range scans.
    • They refer to a state machine for requests (sounds similar to Dynamo)
    • The different modules of the server “rely on an event driven substrate where the message processing pipeline and the task pipeline are split into multiple stages” like SEDA
    • System control messages are UDP
    • Compaction looks for files that are close to each other in size
    • In the Inbox Search case, they send a signal to the cluster when a user clicks into the search box to prime the cache for the user