Graph processing in query languages using UDAFs

Introduction

Processing graphs is a fundamental problem area that has applications in several domains ex: networks. One of the building-block problems in graph processing is “finding connected components (sets of fully connected nodes)”.

It is not straight forward to express graph problems as structured queries unless the database itself exposes primitives for graph processing ex: graph oriented databases like neo4j might do this .

In this article I will attempt to present that graph processing problems can be expressed as queries as long as the database (or the datawarehouse) allows extending its core functionality through UDAF(User Defined Aggregation Function).

I will use Hadoop/Hive as the data warehouse system and use “finding connected components” as the illustrative graph problem. I will try to build code, examine results, make observations for future work and list few optimizations.

Problem Definition

Given the following edges in the graph

N1,N3
N1,N5
N2,N1
N2,N6
N7,N8
N10,N11
N20,N4
N9,N4

Find connected components as below in SQL

[[“N1″,”N3″,”N5″,”N2″,”N6”],[“N7″,”N8”],[“N10″,”N11”],[“N20″,”N4″,”N9”]]

Related Topics

  1. Union Find and Percolation
  2. Transitive closures in SQL
  3. With Recursive in SQL

Solution Strategy

  1. Hadoop-Hive ecosystem allows for distributed processing of data using SQL-like language – HQL.
  2. Transitive closure of unlimited depth arent possible in SQL directly but
  3. UDAFs allow for processing edges row-by-row and accumulate partial results and merge them later to generate final results.
  4. QuickUnion, WeightedQuickUnion, WeightedQuickUnion With path compression are algorithms that can help identify connected components faster.

Solution and Result

UDAF : List<List<Text>> components(column1, column2)

Query : “Select components(node1, node2) from edges”

Code and log

  1. ConnectedComponents.java
  2. results.log

Code and log with optimization 1 (Use Maps instead of List to speedup “contains” checks and avoid nested loops)

  1. ConnectedComponents.java
  2. results.log

Observations for future work

Related problems that can be solved using above UDAF

Using UDAFs for graph processing as above allows for leveraging SQL for generic graph problem ex:

  1. finding if any two nodes, say N3 and N6 in above case, are connected,
  2. is graph fully connected,
  3. finding paths from one node to another, say N3 to N6
  4. finding set of largest size
  5. Incrementally solving any of the above problems i.e. as new edges get added incrementally solving above problems looking only at new edges, say edges table is partitioned by day and the above problems need to be solved on a daily basis.
  6. Build a library of graph processing UDAFs

Other possibilities

  1. Transitive closure problems in SQL can be resolved using this approach ex: finding top manager for a given employee, finding oldest ancestor.

Optimizations

  1. Use Maps instead of List to speedup “contains” checks and avoid nested loops. This would give a result like this –
    {“N9”:[“N20″,”N4″,”N9″],”N5”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N6”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N7”:[“N7″,”N8″],”N8”:[“N7″,”N8″],”N1”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N2”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N3”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N20”:[“N20″,”N4″,”N9″],”N4”:[“N20″,”N4″,”N9″],”N10”:[“N10″,”N11″],”N11”:[“N10″,”N11”]}
    The final result can still be presented as a List<List<Text>>> as above, or in some other form, by modifying the final terminate() method.
  2. Use WeightedQuickUnion with path compression till its time to terminate partial processing and then covert them to Map<Text, List<Text>>.

Conclusion

UDAFs can be leveraged to build graph processing primitives in database or data warehouse that provide features to support UDAFs.

Some databases that support UDAFs are

  1. MS SQL Server 2014
  2. Oracle 11g
  3. PostgreSQL

References

  1. Union Find algorithms – http://algs4.cs.princeton.edu/15uf/
  2. Transitive Closure – http://en.wikipedia.org/wiki/Transitive_closure
  3. Aggregation functions in Oracle –http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/aggr_functions.htm
  4. Aggregation functions in MSSQL – http://msdn.microsoft.com/en-us/library/ms131051.aspx
  5. Aggregation functions in PostgreSQL –http://www.postgresql.org/docs/9.1/static/xaggr.html
  6. Neo4J DFS – StackOverflow Question –http://stackoverflow.com/questions/18112240/searching-for-strongly-connected-components-using-neo4j