Stephen Holiday
Articles
Projects
Notes
Travel
Resume
Contact
Dremel: Interactive Analysis of Web-Scale Datasets
- Google, 2010
[
Paper
] [
Mirror
]
VLDB’10 [
Slides
]
Goal
: Support fast ad-hoc queries for analysis
Noticed: A cluster with thousands of discs can have high throughput and OK latency
Major Points:
Column Oriented Storage
They propose a nested columnar storage which can compactly store diverse schemas in Protocol Buffers.
The SQL-like query language has support for this nesting
Columnar storage allows them to only access the columns relevant to the query
Serving Tree for distributed query execution
Like a distribute search engine
The query starts at the root and is transformed into smaller queries to be run on children
Each child further transforms the query for execution
The aggregate results bubble up
They use similar techniques to retry stragglers on new nodes and can return early with approximate results if configured
SQL-like language
They can operate on data in place
Other notes about distributed computation
FlumeJava: Easy, Efficient Data-Parallel Pipelines
[Google, 2010]
Hive: A Warehousing Solution Over a Map-Reduce Framework
[Facebook, 2009]
MapReduce: Simplified Data Processing on Large Clusters
[Google, 2004]
Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications
[Google, 2010]
Tenzing: A SQL Implementation On The MapReduce Framework
[Google, 2011]