Dremel: Interactive Analysis of Web-Scale Datasets - Google, 2010

  • [Paper] [Mirror]

  • VLDB’10 [Slides]
  • Goal: Support fast ad-hoc queries for analysis
  • Noticed: A cluster with thousands of discs can have high throughput and OK latency
  • Major Points:
    1. Column Oriented Storage
      • They propose a nested columnar storage which can compactly store diverse schemas in Protocol Buffers.
      • The SQL-like query language has support for this nesting
      • Columnar storage allows them to only access the columns relevant to the query
    2. Serving Tree for distributed query execution
      • Like a distribute search engine
      • The query starts at the root and is transformed into smaller queries to be run on children
      • Each child further transforms the query for execution
      • The aggregate results bubble up
      • They use similar techniques to retry stragglers on new nodes and can return early with approximate results if configured
    3. SQL-like language
  • They can operate on data in place