GraphChi: Large-Scale Graph Computation on Just a PC - CMU, 2012

  • [Paper] [Mirror]
  • [OSDI’12] [Video]

  • Big Graph Problems are not the same as Big Data Problems
    • The actual data can fit on a hard drive.
  • Main Problem: Random access on a drive is slow
  • Parallel Sliding Windows: Load/Process/Store all of the data of the adjacent edges at the same time.
    • Graph is stored as in-edges.
    • Split the vertices across P shards (each a file on disk, each file has all the incoming edges to that vertex).
    • Sort the edges in each file by the source vertex.
    • Let’s try and load the sub-graph of shard 1:
      1. Load the first file, now we have all of the in-edges for the subgraph.
      2. Since the other files are sorted by source vertex, all of the out-edges will be at the start of the other shards! We load those.
  • In total, P^2 reads and writes for a full pass on the graph. But P is relatively small. Thus few random reads.
  • Main Takeway: Sometimes by using algorithms and datastructures more cleverly we can achieve similar performance on a single machine.