Bigtable: A Distributed Storage System for Structured Data - Google, 2006

Since I actually read a lot about Cassandra and other database systems before looking into Bigtable, I had to force myself to realize that some of those systems drew inspiration from Bigtable. Not the other way around.

I used Bigtable when I was an intern at Google (and was lucky enough to meet some of the the current team). However I hadn’t read the paper at that time. I wish I had.

As an aside, I think this paper has a great topological sort of information.

  • Bigtable treats data as uninterpreted strings
    • Though there is some support for Protocol Buffers in debug output
  • Simplified the problem:
    1. Do not have multiple copies of the same data
      • GFS does replication
      • At the time of writing there were know views or secondary indices
    2. User dictates whether to stay on disk or in memory
    3. No complex queries (they don’t need to optimize them)
  • Data Model and Organization
    • (row:string, column:string, time:int64) -> string
    • Columns can be filtered with a regular expression
    • Column keys are grouped into sets of Column Families
      • Column Families are stored in their own SSTable
      • They are compressed together, higher compression ratio
      • Has similar benefits to traditional column stores
        • Does not need to load data for unrequested column families
        • I believe Cassandra at one point required that the entire row fit in main memory.
      • Access control is at the Column Family level
        • I found this surprising but neat
    • Reads/Writes under the same row key are atomic
    • Rows are sorted in lexicographical order
      • A client could choose to hash if they wanted random placement
      • Sorted rows allows for cheap scans
    • A range of rows is stored in a tablet
      • At most one tablet server is the owner of a tablet at a time
    • GC for cells versions (timestamps) can happen automatically
      • Either number of versions or age
    • Cells can be used as atomic integer counters
  • User supplied scripts can be run on the tablet servers
    • Written in Sawzall
  • MapReduce adapters for Bigtable
  • Tablets
    • Tablets and logs stored on GFS
    • Tablet splits are initiates by the tablet server
      • Commits the split by adding the new tablet in the METADATA table
      • Then it notifies the master
      • Read and writes can continue while tablets are split and merged
        • (Presumably with the log / memtable)
    • Based on a Log Structured Merge Tree
      • Memtable gets all the writes
        • Writes are placed in the commit log first, of course
        • Periodically transformed into an SSTable and written to disc
        • They use copy-on-write semantics to allow parallel reads and writes
      • Tombstone records are used for deleted data
        • These are GCed at a later time (called a Major Compaction)
      • Bloom filters are used to limit the number of seeks in service of a request for a nonexistent key
        • Otherwise would need to hit every SSTable for the tablet
      • Every so ofterm there is a merging compaction that combines the memtable and some of the SSTables into a single SSTable
    • The metadata tablet lists all of the SSTables for a tablet as well as pointers to commit logs that could contain data not in the SSTables yet (called redo points)
      • The logs could contain data written to a memtable but not transformed to an SSTable before dieing
  • SSTable Format (concept described in the blog post)
    • Typical block size is 64KB, configurable
    • Compression is by block
      • SSTables are by column family so similar data will likely be in the same SSTable
        • They use the example of a collection of HTML pages from the same site with the same boilerplate HTML
      • Typically they use Bentley McIlroy’s scheme to compress “long common strings across a large window” and then Zippy (open sourced as Snappy)
    • SSTable index is stored at the end of the file
    • Index is loaded into memory when opened
      • Means 1 seek to read a value
    • SSTable can be mapped into memory
    • Since SSTables are immutable:
      • No synchronization required to read SSTables
      • Removing deleted data is just a question of GC’ing old SSTables
      • Tables can be split easily
        • Child tables can share the SSTables of the parent
        • Dealing with compactions in this situation is not discussed. My guess is that they will be filtered by the row range of the tablet. During a major compaction I assume there is a reference count in the metadata tablet and it will not be deleted if there are other references to it.
  • Chubby provides coordination required
    • One active Bigtable master
    • Directory of tablet servers
    • Stores schema and ACL
      • The ACL is checked for every read / write
      • This is almost always in the client’s Chubby cache
    • Root tablet (metadata tablet) is stored in Chubby
  • Master
    • Clients do not need to talk to the master, they can go straight to Chubby
    • If the master detects a tablet is unassigned, the master instructs a tablet server to load the tablet
    • Master kills itself if it’s Chubby session expires
  • Caches
    1. Scan Cache: high level cache for key-value pairs (Temporal Locality)
    2. Block Cache: for the SSTable blocks read from GFS (Spacial Locality)
  • Commit Log
    • They don’t want a commit log for each tablet
      • because there would be too many files been written to concurrently
      • I was surprised this was an issue
    • One commit log per tablet server
      • Mutations for different tablets will appear in the same log
      • If a tablet server fails and the other nodes split the tablets among themselves then there would be a ton of reads for the entirety of the same log
      • Instead, the master gets a bunch of servers to sort the commit log by tablet
        • The split is the same 64MB GFS block size
      • Now that the commit log is sorted, each server only needs to read the portions of the log that pertain to the tablet they are loading
    • Each tablet server has two writing threads that write to their own logs
      • If one log file is slow, it can switch to the other thread
      • There are tablet assigned sequence numbers to deal with an duplicated entries
      • I thought this was a cool idea!