TAO: Facebook’s Distributed Data Store for the Social Graph - Facebook, 2013

[Paper] [Mirror]
Read-optimized store for the social graph
Uses MySQL for the backing store, but manages in-memory cache that is aware of the the graph structure
Issues with Memcache
- Inefficient for large edge lists (you have to retrieve the whole list)
  - I wonder if memcache could be modified to return a certain number of bytes at a time. Then use a streaming serialization protocol (or framing)
- Clients do not communicate with each other. This can lead to the infamous thundering herds problem
- Read-after write consistency is expensive
Data Model
- objects: Typed nodes
  - Unique 64-bit ID
- associations: Typed directed edges
  - Has a 32-bit time field
- (It’s unclear why they chose these names)
- Both objects and associations can have key/value pairs (determined by schema)
API
- No compare-and-set for objects
- Associations can be kept in sync with there inverses automatically
- The authors note that most data in a social graph is old and most queries are for recent data
  - Thus, associations are stored so that the most recent associations are easy to retrieve
- I opted to not list all of the API here
- TAO API is mapped to a small set of SQL queries
Architecture
- Storage Layer
  - MySQL is the backing store
  - They might be using RocksDB now
  - Many more shards than servers to allow for load balancing
  - The object ID includes the shard ID
  - Associations are stored on the same shard of it’s source object
- Caching Layer
  - Organized in tiers like Unicorn
    - Each tier can service any request
  - Clients determine which cache server to query
  - If there is a miss or a write, the cache server communicates with others
  - Cache is filled on demand and uses LRU
  - Cache understands the semantics and can retrieve data even if the exact same query had not been issued before
  - Write opts may involve two shards (inverse relationships)
    - There is no atomicity between the two writes
    - They have a job that repairs these relationships
  - Load is balanced among cloned shards, updates are sent to all clones
  - If an object is request a lot, the cache server responds with a version
    - The client caches the data and version
    - In subsequent requests, the client sends the version and the cache server only returns data if the object has changed
  - Some objects have a large number of out edges
    - If a client is testing for the existence of an edge and the edge list does not fit in the cache, the request will need to go to MySQL
    - A solutions is to query in the other direction if the destination vertex will have less out edges
    - Since an edge can only be added after the object was created, the search in the edge list can terminate after the known creation time of the destination object
- Client
  - Multiplexed connections to avoid head-of-line blocking
- Leaders and Followers
  - Followers forward all writes and cache misses to the leader tier
  - Leader sends async cache maintenance messages to follower tier
    - Eventually Consistent
  - If a follower issues a write, the follower’s cache is updated synchronously
  - Each update message has a version number
  - Leader serializes writes
- Multi-DC
  - Read misses are more common than writes in the follower tier
  - They developed a scheme to service the read misses locally
  - They can not afford a full replica in every DC
    - Instead they cluster DCs into regions that are close geographically
    - Shards of the dataset are arranged in a master/slave relationship
    - Followers send read misses and write requests to the local leader
      - Local leaders service read misses locally
      - Slave leaders forward writes to the master shard
        
        The slave leader will update it’s cache ahead of the async updates to the persistent store
  - The same services many shards and thus is master for some and slave for others
Cache Servers
- Memory is partitioned into arenas by association type
  - This mitigates the issues of poorly behaved association types
  - They can also change the lifetime for important associations
- Small items have a lot of pointer overhead
  - They use a directly mapped 8-way LRU cache
  - Used for association counts
MySQL
- Most objects are stored in the same table with a serialized data column
- Some objects are given their own table with a more efficient schema
Consistency
- Read-after-write within a tier
- If a read is marked critical, they are sent to the master region
Failure Detection
- “Aggresive network timeouts”
  - Per-destination timeouts and remembers hosts that are down to avoid attempting to connect to them
  - Periodically check to see if the node is back up
- When a slave store is down, a binlog tailer is added that is used to refill the data once the slave comes back up
In reference to Dynamo:

“TAO accepts lower write availability than Dynamo in exchange for avoiding the programming complexities that arise from multi-master conflict resolution.”

TAO: Facebook’s Distributed Data Store for the Social Graph - Facebook, 2013

Other notes about database