Dremel: Interactive Analysis of Web-Scale Datasets - Google, 2010

[Paper] [Mirror]
VLDB’10 [Slides]
Goal: Support fast ad-hoc queries for analysis
Noticed: A cluster with thousands of discs can have high throughput and OK latency
Major Points:
1. Column Oriented Storage
  - They propose a nested columnar storage which can compactly store diverse schemas in Protocol Buffers.
  - The SQL-like query language has support for this nesting
  - Columnar storage allows them to only access the columns relevant to the query
2. Serving Tree for distributed query execution
  - Like a distribute search engine
  - The query starts at the root and is transformed into smaller queries to be run on children
  - Each child further transforms the query for execution
  - The aggregate results bubble up
  - They use similar techniques to retry stragglers on new nodes and can return early with approximate results if configured
3. SQL-like language
They can operate on data in place