GraphChi: Large-Scale Graph Computation on Just a PC - CMU, 2012

[Paper] [Mirror]
[OSDI’12] [Video]
Big Graph Problems are not the same as Big Data Problems
- The actual data can fit on a hard drive.
Main Problem: Random access on a drive is slow
Parallel Sliding Windows: Load/Process/Store all of the data of the adjacent edges at the same time.
- Graph is stored as in-edges.
- Split the vertices across P shards (each a file on disk, each file has all the incoming edges to that vertex).
- Sort the edges in each file by the source vertex.
- Let’s try and load the sub-graph of shard 1:
  1. Load the first file, now we have all of the in-edges for the subgraph.
  2. Since the other files are sorted by source vertex, all of the out-edges will be at the start of the other shards! We load those.
In total, P^2 reads and writes for a full pass on the graph. But P is relatively small. Thus few random reads.
Main Takeway: Sometimes by using algorithms and datastructures more cleverly we can achieve similar performance on a single machine.