Since I actually read a lot about Cassandra and other database systems before
looking into Bigtable, I had to force myself to realize that some of those
systems drew inspiration from Bigtable. Not the other way around.
I used Bigtable when I was an intern at Google (and was lucky enough to meet
some of the the current team). However I hadn’t read the paper at that time. I
wish I had.
As an aside, I think this paper has a great topological sort of information.
- Bigtable treats data as uninterpreted strings
- Though there is some support for Protocol Buffers in debug output
- Simplified the problem:
- Do not have multiple copies of the same data
- GFS does replication
- At the time of writing there were know views or secondary indices
- User dictates whether to stay on disk or in memory
- No complex queries (they don’t need to optimize them)
- Data Model and Organization
(row:string, column:string, time:int64) -> string
- Columns can be filtered with a regular expression
- Column keys are grouped into sets of Column Families
- Column Families are stored in their own SSTable
- They are compressed together, higher compression ratio
- Has similar benefits to traditional column stores
- Does not need to load data for unrequested column families
- I believe Cassandra at one point required that the entire row fit
in main memory.
- Access control is at the Column Family level
- I found this surprising but neat
- Reads/Writes under the same row key are atomic
- Rows are sorted in lexicographical order
- A client could choose to hash if they wanted random placement
- Sorted rows allows for cheap scans
- A range of rows is stored in a tablet
- At most one tablet server is the owner of a tablet at a time
- GC for cells versions (timestamps) can happen automatically
- Either number of versions or age
- Cells can be used as atomic integer counters
- User supplied scripts can be run on the tablet servers
- MapReduce adapters for Bigtable
- Tablets
- Tablets and logs stored on GFS
- Tablet splits are initiates by the tablet server
- Commits the split by adding the new tablet in the METADATA table
- Then it notifies the master
- Read and writes can continue while tablets are split and merged
- (Presumably with the log / memtable)
- Based on a Log Structured Merge Tree
- Memtable gets all the writes
- Writes are placed in the commit log first, of course
- Periodically transformed into an SSTable and written to disc
- They use copy-on-write semantics to allow parallel reads and
writes
- Tombstone records are used for deleted data
- These are GCed at a later time (called a Major Compaction)
- Bloom filters are used to limit the number of seeks in service of a
request for a nonexistent key
- Otherwise would need to hit every SSTable for the tablet
- Every so ofterm there is a merging compaction that combines the
memtable and some of the SSTables into a single SSTable
- The metadata tablet lists all of the SSTables for a tablet as well as
pointers to commit logs that could contain data not in the SSTables yet
(called redo points)
- The logs could contain data written to a memtable but not transformed
to an SSTable before dieing
- SSTable Format (concept described in the blog post)
- Typical block size is 64KB, configurable
- Compression is by block
- SSTables are by column family so similar data will likely be in the
same SSTable
- They use the example of a collection of HTML pages from the same
site with the same boilerplate HTML
- Typically they use
Bentley McIlroy’s scheme
to compress “long common strings across a large window” and then
Zippy (open sourced as Snappy)
- SSTable index is stored at the end of the file
- Index is loaded into memory when opened
- Means 1 seek to read a value
- SSTable can be mapped into memory
- Since SSTables are immutable:
- No synchronization required to read SSTables
- Removing deleted data is just a question of GC’ing old SSTables
- Tables can be split easily
- Child tables can share the SSTables of the parent
- Dealing with compactions in this situation is not discussed. My
guess is that they will be filtered by the row range of the
tablet. During a major compaction I assume there is a reference
count in the metadata tablet and it will not be deleted if there
are other references to it.
- Chubby provides coordination required
- One active Bigtable master
- Directory of tablet servers
- Stores schema and ACL
- The ACL is checked for every read / write
- This is almost always in the client’s Chubby cache
- Root tablet (metadata tablet) is stored in Chubby
- Master
- Clients do not need to talk to the master, they can go straight to Chubby
- If the master detects a tablet is unassigned, the master instructs a
tablet server to load the tablet
- Master kills itself if it’s Chubby session expires
- Caches
- Scan Cache: high level cache for key-value pairs (Temporal Locality)
- Block Cache: for the SSTable blocks read from GFS (Spacial Locality)
- Commit Log
- They don’t want a commit log for each tablet
- because there would be too many files been written to concurrently
- I was surprised this was an issue
- One commit log per tablet server
- Mutations for different tablets will appear in the same log
- If a tablet server fails and the other nodes split the tablets among
themselves then there would be a ton of reads for the entirety of the
same log
- Instead, the master gets a bunch of servers to sort the commit log by
tablet
- The split is the same 64MB GFS block size
- Now that the commit log is sorted, each server only needs to read the
portions of the log that pertain to the tablet they are loading
- Each tablet server has two writing threads that write to their own logs
- If one log file is slow, it can switch to the other thread
- There are tablet assigned sequence numbers to deal with an duplicated
entries
- I thought this was a cool idea!