TAO: Facebook’s Distributed Data Store for the Social Graph - Facebook, 2013

  • [Paper] [Mirror]

  • Read-optimized store for the social graph
  • Uses MySQL for the backing store, but manages in-memory cache that is aware of the the graph structure
  • Issues with Memcache
    • Inefficient for large edge lists (you have to retrieve the whole list)
      • I wonder if memcache could be modified to return a certain number of bytes at a time. Then use a streaming serialization protocol (or framing)
    • Clients do not communicate with each other. This can lead to the infamous thundering herds problem
    • Read-after write consistency is expensive
  • Data Model
    • objects: Typed nodes
      • Unique 64-bit ID
    • associations: Typed directed edges
      • Has a 32-bit time field
    • (It’s unclear why they chose these names)
    • Both objects and associations can have key/value pairs (determined by schema)
  • API
    • No compare-and-set for objects
    • Associations can be kept in sync with there inverses automatically
    • The authors note that most data in a social graph is old and most queries are for recent data
      • Thus, associations are stored so that the most recent associations are easy to retrieve
    • I opted to not list all of the API here
    • TAO API is mapped to a small set of SQL queries
  • Architecture
    • Storage Layer
      • MySQL is the backing store
      • They might be using RocksDB now
      • Many more shards than servers to allow for load balancing
      • The object ID includes the shard ID
      • Associations are stored on the same shard of it’s source object
    • Caching Layer
      • Organized in tiers like Unicorn
        • Each tier can service any request
      • Clients determine which cache server to query
      • If there is a miss or a write, the cache server communicates with others
      • Cache is filled on demand and uses LRU
      • Cache understands the semantics and can retrieve data even if the exact same query had not been issued before
      • Write opts may involve two shards (inverse relationships)
        • There is no atomicity between the two writes
        • They have a job that repairs these relationships
      • Load is balanced among cloned shards, updates are sent to all clones
      • If an object is request a lot, the cache server responds with a version
        • The client caches the data and version
        • In subsequent requests, the client sends the version and the cache server only returns data if the object has changed
      • Some objects have a large number of out edges
        • If a client is testing for the existence of an edge and the edge list does not fit in the cache, the request will need to go to MySQL
        • A solutions is to query in the other direction if the destination vertex will have less out edges
        • Since an edge can only be added after the object was created, the search in the edge list can terminate after the known creation time of the destination object
    • Client
      • Multiplexed connections to avoid head-of-line blocking
    • Leaders and Followers
      • Followers forward all writes and cache misses to the leader tier
      • Leader sends async cache maintenance messages to follower tier
        • Eventually Consistent
      • If a follower issues a write, the follower’s cache is updated synchronously
      • Each update message has a version number
      • Leader serializes writes
    • Multi-DC
      • Read misses are more common than writes in the follower tier
      • They developed a scheme to service the read misses locally
      • They can not afford a full replica in every DC
        • Instead they cluster DCs into regions that are close geographically
        • Shards of the dataset are arranged in a master/slave relationship
        • Followers send read misses and write requests to the local leader
          • Local leaders service read misses locally
          • Slave leaders forward writes to the master shard
            • The slave leader will update it’s cache ahead of the async updates to the persistent store
      • The same services many shards and thus is master for some and slave for others
  • Cache Servers
    • Memory is partitioned into arenas by association type
      • This mitigates the issues of poorly behaved association types
      • They can also change the lifetime for important associations
    • Small items have a lot of pointer overhead
      • They use a directly mapped 8-way LRU cache
      • Used for association counts
  • MySQL
    • Most objects are stored in the same table with a serialized data column
    • Some objects are given their own table with a more efficient schema
  • Consistency
    • Read-after-write within a tier
    • If a read is marked critical, they are sent to the master region
  • Failure Detection
    • “Aggresive network timeouts”
      • Per-destination timeouts and remembers hosts that are down to avoid attempting to connect to them
      • Periodically check to see if the node is back up
    • When a slave store is down, a binlog tailer is added that is used to refill the data once the slave comes back up
  • In reference to Dynamo:

    “TAO accepts lower write availability than Dynamo in exchange for avoiding the programming complexities that arise from multi-master conflict resolution.”