Hive: A Warehousing Solution Over a Map-Reduce Framework - Facebook, 2009

I’m a big fan of Hive, I used it extensively in Project Rhino as part of the ETL process. It saved us many hours of writing and optimizing MapReduce jobs. While no one else on the team had used MapReduce before, they were all familiar with SQL. Everyone was able to get up to speed very quickly.

Our initial version of the ETL process was mostly contained within a PostgreSQL DB. Therefore it was pretty easy to transform the original queries in to HiveQL.

  • Each table had it’s own HDFS directory
  • Users associate a table with a serialization format
  • Uses lazy de-serialization
  • Table can have partitions
    • For example, clock log data could be first partitioned by date and then by country
    • The data would be store in directories like /warehouse/clicks/ds=20090101/ctry=US.
    • A user can run a query that runs on all or just some partitions without needing to scan the entire table
    • Partitions can have different schemas and serialization
      • Allows for schema evolution
  • Tables can be bucketed
    • This is based on the hash of one of the columns
    • Used to sample data without resorting to a full table scan
  • HiveQL
    • Not complete SQL
    • Has a DDL and DML
      • DML can only load or insert, no updating or deleting
    • Multi-table Insert
      • Woah, I didn’t know about this. Super useful
      • You can run a scan of a table and then piggyback multiple queries on the same scan for efficiency
    • Supports UDFs and UDAFs (User Defined Aggregation Function) in Java
    • Can use streaming MapReduce scripts in any language
      • We should have probably utilized this more instead of some of our custom MapReduce Java jobs
      • Then they would have been able to be piggybacked on other table scans
  • Individual jobs use HDFS to communicate temporary files. No streaming like Google’s Tenzing
  • Architecture
    • Metastore - Stores the catalog of tables and formats
      • Uses a traditional RDBMS like MySQL or PostgreSQL
    • Interfaces - CLI, web UI, and JDBC / ODBC
    • Thrift Server - exposes client API for the drivers
    • Driver - controls the a Hive query, submits the queries to the Execution engine
    • Compiler
    • Execution Engine - Hadoop
  • Future Work
    • The authors say Hive is just a first step in their warehouse
    • Currently they have a rule-based optimizer but plan to build a cost-based optimizer
    • They want to experiment with columnar storage (which they do in RCFile)