Earlybird: Real-Time Search at Twitter - Twitter, 2012

[Paper] [Mirror]

Note: Krishna was my manager at Twitter and he set me up with some time with Michael to talk about Earlybird and search in general at Twitter. I wish I had kept my notes from that discussion.

Earlybird is the system that holds the near real-time index of tweets
Goals:
- “Low-latency, high throughput query evaluation”
- “High ingestion rate and immediate data availability”
  - Needs to deal with spikes
  - Tweets need to be searchable “on the order of a few seconds”
  - Historically search indices have been built in a batch operation
- “Concurrent reads and writes”
  - Some systems use an atomic swap between new and old versions of the index
  - I note that some search systems actually bring up a whole new replica set with the new data
  - One of the most interesting ideas in this paper is the lock-free algorithm for single-writer, multiple-reader inverted indexes
- “Domincance of temporal signal”
  - In existing search systems, age of the document isn’t very important
  - For Tweets, the authors claim that time is likely the most important signal
  - This guides the design to be efficient for searching in reverse chronological order
Overview
- Earlybird is build on top of Lucene
- Earlybird is not the only source of tweets, Blender is the system that mixes and re-ranks results
  - It’s a pretty neat system that allows developers to compose existing search services into workflows without starting from scratch each time
- Tweets are hash partitioned across Earlybird shards
- “End-to-end, we typically observe a 10 second indexing latency”
- 50 ms query latency
Index Organization
- New tweets go into an in-memory structure optimized for single-writer, multiple-reader
- Periodically they convert the structure into a read-optimized read-only structure.
- Dictionary
  - The term dictionary uses open addressing on primitive Java arrays
    - To alleviate issues with GC
  - Each term receives a monotonically increasing term ID
  - Term data is held in parallel arrays
    - # of postings for the term
    - pointer to the tail of the postings list
- Active Index
  - Each posting is a 32-bit integer
    - 24 bits for document-id
    - 8 for term position
      - Tweets are 140 chars
  - Stored in an array, cache friendly
  - Postings lists vary in size
    - Zipfian distribution
  - Memory Setup
    - Pools are used to hold postings
    - Each pool is 2^12 elements
    - In each pool, they allocate a slice at time
      - Sizes: 2^1, 2^2, 2^7, 2^11 elements (i.e. 32-bit int)
    - The first time a term is encountered, a postings list is created in the first pool (i.e. holds 2 elements)
    - Once the slice runs out space, a slice is allocated in the next level pool.
    - Each slice (except 2^1) reserves the first 32-bits for a pointer to the previous slice
      - 2 bits to address the pool
      - 19-29 bits for the slice index
      - 1-11 bits for the offset
    - In the 2^11 pool, postings lists can be allocated multiple slices
    - No array copies needed
      - No GC
    - Dictionary has a pointer to the current tail of the postings list for a term
- Optimized Index
  - When the active index segment stops accepting writes, they create a read-optimized index
  - They know the exact size required for each postings list
    - Postings lists can then be laid out end-to-end as large array of ints
    - Dictionary is just a pointer into the array
  - Define two types of positings lists
    - Short (<1000)
      - Stored as before but reverse-chronological
    - Long
      - Block-based compression akin to PForDelta
      - Better than variable-length techniques that suffer from many branch mispredicts
      - Just need to apply a template of bit ops
- Concurrency
  - Limit to a single writer per index segment for simplicity
  - Protocol:
    1. Write to the postings list for the terms
    2. Update tail pointers
    3. Increment maxDoc (volatile, in JVM this means memory barrier)
  - Whenever searching the postings list, readers ignore docs encountered greater than maxDoc.
  - The authors note there are several subtle corner cases beyond the scope of the paper.
    - I’m dying to know that they are!
In Production
- Each machine has two quad-core procs with 76 GB RAM
- 64 GB heap space
- An index segment with 16 million documents is about 6.7 GB
- Converting to the read-only index saves about 57%
Future
- The authors note that there is much work to be done on creating metrics that measure the correctness of results returned from a search engine that deals with real-time documents
  
  “In more concrete terms, is a ‘perfect’ tweet from five minutes ago better than a ‘good’ tweet from five seconds ago? What about compared to a ‘fair’ tweet from one second ago? Does it depend on the type of query and its velocity? Does it depend on the particular user?”
  - To me this sounds like we need a lot of tests to figure this out
- What about “a cluster of tweets about the same topic or event”
- NLP community is not as concerned with the efficiency of their algos
- Tweets have a lot less text to analyze
- Often tweets mix multiple languages
  - I hadn’t thought about this!

Earlybird: Real-Time Search at Twitter - Twitter, 2012

Other notes about search