If you haven’t read the Dynamo paper, I suggest you read it before this paper.
- Originally for their Inbox Search service
- Data Model
- Operations are atomic under a row key
- Columns are grouped in column families like Bigtable
- They also include a new concept of super column families
- A column family within a column family
- Application can dictate the sort order of columns in a super column family
or column family
- In Inbox Search, they sort columns by time so recent results appear first
- Partitioning and Replication
- Want incremental scalability (like Dynamo)
- They use consistent hashing (like Dynamo)
- Random positioning leads to no-uniform load
- They note that it also ignores the heterogeneity of servers
- They didn’t use the Dynamo technique of v-nodes
- They instead analyze load information and move nodes around the ring
- They argue for easy of implementation and allows them to “make
very deterministic choices about the load balancing”
- The first reason struck me as a little odd
- They have a coordinator node (a la Dynamo)
- They use read repair
- Multiple replication policies
- I found this cool when I first used Cassandra in 2010
- Policies:
- Rack Unaware
- Rack Aware
- Datacenter Aware
- Now the open source version has pluggable policies for other environments
like S3
- Bootstrapping
- Uses seeds like Dynamo
- Chooses a random position in the ring
- Nodes are configured with a Cluster Name
- A node will only talk to other nodes with the same Cluster Name
- This saved my bacon many times when I was running multiple clusters on
Amazon for a company I worked for
- Explicit addition and removal of nodes to the cluster (like Dynamo)
- At the time of the paper they were planning on allowing multiple replicas
to speed up the initial bootstrap data transfer
- Membership
- They use Zookeeper to elect a leader in charge of membership and ring
management
- Anti-entropy Gossip protocol (based on Scuttlebutt)
- Also used to transmit other system state
- They use a modified version of the Φ Accrual Failure Detector
- Essentially, the model gives a real number for the level of suspicion
that a node has failed
- Adapts to current network conditions
- I thought this was a cool idea!
- Local Persistence
- Commit log on the node (on a dedicated disk for performance)
- In memory data structure that is persisted periodically
- Per column family
- Since the persisted files are immutable they do not need read locks
- Though I’d imagine you’d have to take care during compaction with an
atomic swap while ensure there are no readers to the old files
- They use bloom filters
- An index for the columns per key so they can jump to the right spot on disk
- New index every 256K chunk boundary
- Misc.
- They don’t discuss any kind of range scans. The current open source version
does have range scans.
- They refer to a state machine for requests (sounds similar to Dynamo)
- The different modules of the server “rely on an event driven substrate where
the message processing pipeline and the task pipeline are split into
multiple stages” like SEDA
- System control messages are UDP
- Compaction looks for files that are close to each other in size
- In the Inbox Search case, they send a signal to the cluster when a user
clicks into the search box to prime the cache for the user