RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems - Facebook, 2011

  • [Paper] [Mirror]

  • Conventional Data Placement
    • Row Store
      • Have to read all columns, can’t skip them
      • Compression ratio is lower
    • Column Store
      • Expensive to reconstruct records
        • The columns could be on different machines
      • Can avoid by creating materialized views of columns accessed together frequently
    • Hybrid PAX (Column store in each disk page)
      • Designed to optimize the CPU cache
      • Still need to read the whole page from disk
  • RCFile takes after hybrid PAX, but uses larger sizes
    • Also adds lazy decompression, columns that are not used are not decompressed until they are really needed
      • Consider a scan with a where clause, if the column is not part of the predicate, the column only needs to be decompressed if the predicate matches
  • Each HDFS block contains a series of row groups
    • Each row group contains:
      • Sync marker
      • Metadata Header
        • Pointer to the start of each column
        • Uses run length encoding
      • Compressed columns
        • Currently Gzip on high
  • At some point, increasing the row group size provides diminishing compression returns
    • They use 4MB
    • A large row group size also makes lazy decompression less effective
      • More likely that the row in the column will be needed