GraphChi: Large-Scale Graph Computation on Just a PC
- CMU, 2012
Big Graph Problems are not the same as Big Data Problems
The actual data can fit on a hard drive.
: Random access on a drive is slow
Parallel Sliding Windows: Load/Process/Store all of the data of the adjacent edges at the same time.
Graph is stored as in-edges.
Split the vertices across P shards (each a file on disk, each file has all the incoming edges to that vertex).
Sort the edges in each file by the source vertex.
Let’s try and load the sub-graph of shard 1:
Load the first file, now we have all of the in-edges for the subgraph.
Since the other files are sorted by source vertex, all of the out-edges will be at the start of the other shards! We load those.
reads and writes for a full pass on the graph. But P is relatively small. Thus few random reads.
: Sometimes by using algorithms and datastructures more cleverly we can achieve similar performance on a single machine.
Other notes about graph
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
Pregel: A System for Large-Scale Graph Processing