Earlybird: Real-Time Search at Twitter - Twitter, 2012

Note: Krishna was my manager at Twitter and he set me up with some time with Michael to talk about Earlybird and search in general at Twitter. I wish I had kept my notes from that discussion.

  • Earlybird is the system that holds the near real-time index of tweets
  • Goals:
    • “Low-latency, high throughput query evaluation”
    • “High ingestion rate and immediate data availability”
      • Needs to deal with spikes
      • Tweets need to be searchable “on the order of a few seconds”
      • Historically search indices have been built in a batch operation
    • “Concurrent reads and writes”
      • Some systems use an atomic swap between new and old versions of the index
      • I note that some search systems actually bring up a whole new replica set with the new data
      • One of the most interesting ideas in this paper is the lock-free algorithm for single-writer, multiple-reader inverted indexes
    • “Domincance of temporal signal”
      • In existing search systems, age of the document isn’t very important
      • For Tweets, the authors claim that time is likely the most important signal
      • This guides the design to be efficient for searching in reverse chronological order
  • Overview
    • Earlybird is build on top of Lucene
    • Earlybird is not the only source of tweets, Blender is the system that mixes and re-ranks results
      • It’s a pretty neat system that allows developers to compose existing search services into workflows without starting from scratch each time
    • Tweets are hash partitioned across Earlybird shards
    • “End-to-end, we typically observe a 10 second indexing latency”
    • 50 ms query latency
  • Index Organization
    • New tweets go into an in-memory structure optimized for single-writer, multiple-reader
    • Periodically they convert the structure into a read-optimized read-only structure.
    • Dictionary
      • The term dictionary uses open addressing on primitive Java arrays
        • To alleviate issues with GC
      • Each term receives a monotonically increasing term ID
      • Term data is held in parallel arrays
        • # of postings for the term
        • pointer to the tail of the postings list
    • Active Index
      • Each posting is a 32-bit integer
        • 24 bits for document-id
        • 8 for term position
          • Tweets are 140 chars
      • Stored in an array, cache friendly
      • Postings lists vary in size
        • Zipfian distribution
      • Memory Setup
        • Pools are used to hold postings
        • Each pool is 2^12 elements
        • In each pool, they allocate a slice at time
          • Sizes: 2^1, 2^2, 2^7, 2^11 elements (i.e. 32-bit int)
        • The first time a term is encountered, a postings list is created in the first pool (i.e. holds 2 elements)
        • Once the slice runs out space, a slice is allocated in the next level pool.
        • Each slice (except 2^1) reserves the first 32-bits for a pointer to the previous slice
          • 2 bits to address the pool
          • 19-29 bits for the slice index
          • 1-11 bits for the offset
        • In the 2^11 pool, postings lists can be allocated multiple slices
        • No array copies needed
          • No GC
        • Dictionary has a pointer to the current tail of the postings list for a term
    • Optimized Index
      • When the active index segment stops accepting writes, they create a read-optimized index
      • They know the exact size required for each postings list
        • Postings lists can then be laid out end-to-end as large array of ints
        • Dictionary is just a pointer into the array
      • Define two types of positings lists
        • Short (<1000)
          • Stored as before but reverse-chronological
        • Long
          • Block-based compression akin to PForDelta
          • Better than variable-length techniques that suffer from many branch mispredicts
          • Just need to apply a template of bit ops
    • Concurrency
      • Limit to a single writer per index segment for simplicity
      • Protocol:
        1. Write to the postings list for the terms
        2. Update tail pointers
        3. Increment maxDoc (volatile, in JVM this means memory barrier)
      • Whenever searching the postings list, readers ignore docs encountered greater than maxDoc.
      • The authors note there are several subtle corner cases beyond the scope of the paper.
        • I’m dying to know that they are!
  • In Production
    • Each machine has two quad-core procs with 76 GB RAM
    • 64 GB heap space
    • An index segment with 16 million documents is about 6.7 GB
    • Converting to the read-only index saves about 57%
  • Future
    • The authors note that there is much work to be done on creating metrics that measure the correctness of results returned from a search engine that deals with real-time documents

      “In more concrete terms, is a ‘perfect’ tweet from five minutes ago better than a ‘good’ tweet from five seconds ago? What about compared to a ‘fair’ tweet from one second ago? Does it depend on the type of query and its velocity? Does it depend on the particular user?”

      • To me this sounds like we need a lot of tests to figure this out
    • What about “a cluster of tweets about the same topic or event”
    • NLP community is not as concerned with the efficiency of their algos
    • Tweets have a lot less text to analyze
    • Often tweets mix multiple languages
      • I hadn’t thought about this!