Graph processing in query languages – Part III using Hive and HBase

Approach

  • HBase is great at large number of (on-demand) columns, random writes and reads and maintaining history
  • Against each node(row key), treat every neighbor as a column with a fixed (can be extended for weighted graphs) value of 1. Include self loop.
  • Pick node and least neighbor and insert into another hbase table with just node and component-id(least neighbor)
  • Use joins (based on GIMV) to refine component-ids. Iterate till convergence is achieved. (In several ways join is also similar to a BSP approach in Pregel/Giraph)
  • Use hive as the language for above steps

Solution is here

Testing

  • Illustrative log on a small graph is here. Its summary is here
  • Benchmark has not been done

Features

  • Incremental processing of graphs is feasible
  • Should theoretically scale for web scale graphs – sparse or dense.
  • Though based on GIMV – Matrix Vector multiplication isnt needed. GIMV is a very useful base to model solutions like above for other graph problems that can be solved by it.

References

Illustration

Gimv hbase