Note: Krishna was my manager at Twitter and he set me up with some time with Michael to talk about Earlybird and search in general at Twitter. I wish I had kept my notes from that discussion.

• Earlybird is the system that holds the near real-time index of tweets
• Goals:
• “Low-latency, high throughput query evaluation”
• “High ingestion rate and immediate data availability”
• Needs to deal with spikes
• Tweets need to be searchable “on the order of a few seconds”
• Historically search indices have been built in a batch operation
• Some systems use an atomic swap between new and old versions of the index
• I note that some search systems actually bring up a whole new replica set with the new data
• One of the most interesting ideas in this paper is the lock-free algorithm for single-writer, multiple-reader inverted indexes
• “Domincance of temporal signal”
• In existing search systems, age of the document isn’t very important
• For Tweets, the authors claim that time is likely the most important signal
• This guides the design to be efficient for searching in reverse chronological order
• Overview
• Earlybird is build on top of Lucene
• Earlybird is not the only source of tweets, Blender is the system that mixes and re-ranks results
• It’s a pretty neat system that allows developers to compose existing search services into workflows without starting from scratch each time
• Tweets are hash partitioned across Earlybird shards
• “End-to-end, we typically observe a 10 second indexing latency”
• 50 ms query latency
• Index Organization
• New tweets go into an in-memory structure optimized for single-writer, multiple-reader
• Dictionary
• The term dictionary uses open addressing on primitive Java arrays
• To alleviate issues with GC
• Each term receives a monotonically increasing term ID
• Term data is held in parallel arrays
• # of postings for the term
• pointer to the tail of the postings list
• Active Index
• Each posting is a 32-bit integer
• 24 bits for document-id
• 8 for term position
• Tweets are 140 chars
• Stored in an array, cache friendly
• Postings lists vary in size
• Zipfian distribution
• Memory Setup
• Pools are used to hold postings
• Each pool is 2^12 elements
• In each pool, they allocate a slice at time
• Sizes: 2^1, 2^2, 2^7, 2^11 elements (i.e. 32-bit int)
• The first time a term is encountered, a postings list is created in the first pool (i.e. holds 2 elements)
• Once the slice runs out space, a slice is allocated in the next level pool.
• Each slice (except 2^1) reserves the first 32-bits for a pointer to the previous slice
• 2 bits to address the pool
• 19-29 bits for the slice index
• 1-11 bits for the offset
• In the 2^11 pool, postings lists can be allocated multiple slices
• No array copies needed
• No GC
• Dictionary has a pointer to the current tail of the postings list for a term
• Optimized Index
• When the active index segment stops accepting writes, they create a read-optimized index
• They know the exact size required for each postings list
• Postings lists can then be laid out end-to-end as large array of ints
• Dictionary is just a pointer into the array
• Define two types of positings lists
• Short (<1000)
• Stored as before but reverse-chronological
• Long
• Block-based compression akin to PForDelta
• Better than variable-length techniques that suffer from many branch mispredicts
• Just need to apply a template of bit ops
• Concurrency
• Limit to a single writer per index segment for simplicity
• Protocol:
1. Write to the postings list for the terms
2. Update tail pointers
3. Increment maxDoc (volatile, in JVM this means memory barrier)
• Whenever searching the postings list, readers ignore docs encountered greater than maxDoc.
• The authors note there are several subtle corner cases beyond the scope of the paper.
• I’m dying to know that they are!
• In Production
• Each machine has two quad-core procs with 76 GB RAM
• 64 GB heap space
• An index segment with 16 million documents is about 6.7 GB