Dynamo: Amazon's Highly Available Key-value Store - Amazon, 2007

  • [Paper] [Mirror]

  • Noticed:
    • Most applications only needed primary key access to data
    • They needed availability, not consistency
    • Originally developed in service of the shopping cart
    • No operations spanning keys
  • High Level Features
    • Decentralized
      • They say this was to make it simpler, interestingly Bigtable authors used a single master approach for simplicity. Not exactly a fair comparison though.
    • Optimized for small objects (less than 1 MB)
    • Nodes can be added or removed without manual partitioning
      • Though nodes are supposed to be added with a tool
    • Incremental scalability
      • Can add one node at a time, not some multiple of N
    • Symmetry
      • No distinguished nodes
      • Any node can handle a get or put
        • Forwards it to the responsible node
        • Some clients opt to be intelligent and talk to the responsible node directly
  • Dynamo is a “zero-hop DHT”
    • Consistent hashing
    • They use v-nodes
      • Load from a failed node is evenly dispersed
      • The number of v-nodes assigned to a machine can be set based on the specific heterogeneous server
      • Preference List: List of nodes responsible for storing a key
        • Not stored obviously, calculated
        • Contains more than N nodes for availability
        • List skips any physical node already in the list
        • If the first N nodes are healthy, only the first N nodes will receive a write or a read
        • Any of the first N nodes can act as Coordinator
          • Coordinator is responsible for replicating to the N-1 next nodes on the ring
  • Data Versioning
    • They use vector clocks
      • I love vector clocks. My first experience with them was when I used Riak (a Dynamo based KV-store).
    • On a read, a client may receive multiple conflicting versions and the associated vector clocks (inside of a version context).
      • The client can decide how to reconcile the conflict
        • In practice, I’ve found that this means creating a Model or DAO to ensure consistent conflict resolution across a code base.
      • The client sends the new version along with the context the server gave it
        • This is to make clear which version the client is updating
    • Since usually only the first N nodes in the preference list write data, the vector clock will not be that long. Though it will still grow as the cluster changes over time (and with failures)
      • Their vector clocks include a timestamp with the per node logical timestamp
      • When the vector clock reaches a threshold size, they remove the oldest vector entry
      • I found it interesting that they wouldn’t remove based on time. My concern is that if the cluster as a whole becomes unstable (say power failure or core switch failure) the vector clock could grow rapidly but old versions would be difficult to merge later.

        Perhaps they do not remove the oldest version if it’s recent.

  • Consistency
    • Standard quorum based (R, W, N)
      • I say standard but that is partially from my experience with Cassandra
    • R and W are client tunable
    • Some applications use R=1, W=N for an authoritative persistent cache
    • They use a sloppy quorum
      • The first N healthy nodes are used, not the first N nodes in the list
    • Hinted Handoff
      • A client can set W to 1 and as long as 1 node confirms the right it will be considered a success
      • When more nodes come back online, hinted handoff is used to deliver the data to the responsible node
      • This is a separate local DB that includes metadata about where to send it
      • This works especially well with cross DC where links could fail temporarily cutting off a huge portion of the cluster
  • Replica Synchronization
    • Use a Merkle tree to minimize traffic required to compare v-nodes
    • Care must be taken to not change the range of keys serviced by a v-node too much as that requires recomputing many Merkle trees
    • Question: Could the Merkle tree be constructed in such away that only a smaller portion of the tree needs to be recalculated?
    • Read repair
      • When the coordinator sends a read for a key to the other nodes, it waits a bit to receive late replies
        • Even if it has received enough replies to satisfy the clients R
      • It compares the versions returned and updates stale replicas
      • This takes load off the anti-entropy protocol
  • Membership
    • Noticed: Not being able to reach a node doesn’t usually mean it will never come back.
    • Have explicit membership addition and removal
    • Gossip-based protocol for membership history propagation
      • History, not list. Otherwise how do you know who is right?
      • Each node contacts a random node every second to compare membership histories
    • When a node is added, other nodes will notice the change in the ring and offer to send the appropriate data to the new node
  • Failure Detection
    • Local notion of failure is fine
      • If a node can’t reach a given peer it doesn’t matter to it if the rest of the cluster can
    • Without client requests nodes don’t really care which nodes are up or down
      • This assumes the data is stored durably somewhere
  • Partitioning and Placement strategies
    • They wanted to separate partitioning from placement
    1. T random tokens per node and partition nodes by token value
      • When a node joins, a lot of nodes have to scan their data to send the required data
      • Merkle trees must be recalculated for many nodes
      • Snapshots are hard
      • I hadn’t thought of this aspect of managing the cluster, even though I’ve snapshotted Cassandra clusters
    2. T random tokens per node and equal sized partitions
      • “A partition is placed on the first N unique nodes that are encountered while walking … [the] ring clockwise from the end of the partition”
    3. Q/S tokens per node, equal sized partitions
    • Strategy 3 was chosen for faster bootstrapping / recovery and for easy snapshots
    • However, coordination is required for changes of node membership
  • Misc.
    • Reads and Writes create a state machine on the node (one-to-one)
      • I thought this was a cool way to organize a potentially complex flow of operation (once you factor in the preference list and retries). Sure beats a ton of flags about state.