Graph processing in query languages using UDAFs – Part II

In the previous post I have attempted to make a case that it is feasible to perform some graph processing, (specifically the problem of finding components in a graph given edges), in query languages using user-defined-aggregation-functions.

The approach relied on going over edges one by one and continuously build up components. The components were built as a <Key, Value> pair where Key was the node and the Value was the array of nodes constituting a component.

For example for a dataset like this


Below would be the result


The above approach gives results in a map structure that is easier to join with other tables.

This approach as we noted in earlier post could need considerable compute power as the number of edges increase. This is due to the fact that as number of edges increase the component size inturn increases and this leads to O(n^3) complexity for the merge method which is considerable.

We also noted in the earlier post that a better approach would be to implement Weighted Quick Union with Path Compression(WQUPC) algorithm which has several optimizations ontop of the map approach. Some of them include

    1. Build trees to represent clusters.
    2. Reduce depth of individual trees as the algorithm progresses through edges.
    3. Use integer arrays vs heavier data structures
    4. No repetitions of clusters

NOTE: This approach would need atleast one array as big as the number of nodes in the graph.

There is one challenge in implementing the above approach in a horizontal-scale cluster setup ex: in Hadoop-Hive ecosystem.

Sets of edges need to be looked at in parallel and partial results need to be merged.
Below is an attempt at implementing WQUPC in Hive as an UDAF.

Below is the result in the form of <Node, Root of the cluster tree>

N9 N4
N8 N8
N7 N8
N6 N5
N5 N5
N4 N4
N3 N5
N20 N4
N2 N5
N11 N11
N10 N11
N1 N5

This approach is very close to O(m* log(n)) complexity where n is the number of edges and m is the number of connections.

It easily scales to 10s of millions of edges.

Gitlab – Bitnami – An Awesome combination

Gitlab community edition is a fantastic software for code hosting and bitnami has a very nice self-contained gitlab distribution.

Few observations on bitnami distribution of gitlab 7.3.2.

Few references for upgrade:

Fix for Poodle – SSL 3.0 Vulnerability

Another security vulnerability came to light today, on SSL this time.

What: POODLE attack (Padding Oracle On Downgraded Legacy  Encryption) will allow stealing “secure” HTTP cookies (or other bearer  tokens such as HTTP Authorization header contents).
Test: If the below command succeeds it means that this vulnerability exists.

$ curl -v3 -X HEAD “”

Details : For more details read the security advisory at

Fix: If SSLv3 is already disabled on the server side then nothing needs to be done else disabling the SSL 3.0 protocol in the client or in the server (or both) will completely avoid it. Since we cant control clients this means fixing it on the server side is the guaranteed fix. Also no modern browsers or mobile devices need SSLv3 – not even IE 8 on Windows XP.



On an Ubuntu system with openssl 1.0.1f to disable SSLv3 on server side modify the entry “SSLProtocol all” in ssl.conf (under apache’s mods-available directory) to “SSLProtocol +TLSv1 +TLSv1.1 +TLSv1.2”

$ openssl version
OpenSSL 1.0.1f 6 Jan 2014
$ grep -nri “SSLProtocol” /etc/apache2/
/etc/apache2/mods-available/ssl.conf:64:    SSLProtocol all
# modify contents in ssl.conf
$ sudo vi /etc/apache2/mods-available/ssl.conf
[sudo] password for sandeep:
$ grep -nri “SSLProtocol” /etc/apache2/
/etc/apache2/mods-available/ssl.conf:64:    SSLProtocol +TLSv1 +TLSv1.1 +TLSv1.2
$ sudo service apache2 restart

Graph processing in query languages using UDAFs


Processing graphs is a fundamental problem area that has applications in several domains ex: networks. One of the building-block problems in graph processing is “finding connected components (sets of fully connected nodes)”.

It is not straight forward to express graph problems as structured queries unless the database itself exposes primitives for graph processing ex: graph oriented databases like neo4j might do this .

In this article I will attempt to present that graph processing problems can be expressed as queries as long as the database (or the datawarehouse) allows extending its core functionality through UDAF(User Defined Aggregation Function).

I will use Hadoop/Hive as the data warehouse system and use “finding connected components” as the illustrative graph problem. I will try to build code, examine results, make observations for future work and list few optimizations.

Problem Definition

Given the following edges in the graph


Find connected components as below in SQL


Related Topics

  1. Union Find and Percolation
  2. Transitive closures in SQL
  3. With Recursive in SQL

Solution Strategy

  1. Hadoop-Hive ecosystem allows for distributed processing of data using SQL-like language – HQL.
  2. Transitive closure of unlimited depth arent possible in SQL directly but
  3. UDAFs allow for processing edges row-by-row and accumulate partial results and merge them later to generate final results.
  4. QuickUnion, WeightedQuickUnion, WeightedQuickUnion With path compression are algorithms that can help identify connected components faster.

Solution and Result

UDAF : List<List<Text>> components(column1, column2)

Query : “Select components(node1, node2) from edges”

Code and log

  2. results.log

Code and log with optimization 1 (Use Maps instead of List to speedup “contains” checks and avoid nested loops)

  2. results.log

Observations for future work

Related problems that can be solved using above UDAF

Using UDAFs for graph processing as above allows for leveraging SQL for generic graph problem ex:

  1. finding if any two nodes, say N3 and N6 in above case, are connected,
  2. is graph fully connected,
  3. finding paths from one node to another, say N3 to N6
  4. finding set of largest size
  5. Incrementally solving any of the above problems i.e. as new edges get added incrementally solving above problems looking only at new edges, say edges table is partitioned by day and the above problems need to be solved on a daily basis.
  6. Build a library of graph processing UDAFs

Other possibilities

  1. Transitive closure problems in SQL can be resolved using this approach ex: finding top manager for a given employee, finding oldest ancestor.


  1. Use Maps instead of List to speedup “contains” checks and avoid nested loops. This would give a result like this –
    The final result can still be presented as a List<List<Text>>> as above, or in some other form, by modifying the final terminate() method.
  2. Use WeightedQuickUnion with path compression till its time to terminate partial processing and then covert them to Map<Text, List<Text>>.


UDAFs can be leveraged to build graph processing primitives in database or data warehouse that provide features to support UDAFs.

Some databases that support UDAFs are

  1. MS SQL Server 2014
  2. Oracle 11g
  3. PostgreSQL


  1. Union Find algorithms –
  2. Transitive Closure –
  3. Aggregation functions in Oracle –
  4. Aggregation functions in MSSQL –
  5. Aggregation functions in PostgreSQL –
  6. Neo4J DFS – StackOverflow Question –

VoltDB – Cheat Sheet

  1. Sequences in VoltDB –
  2. Dual and Connection keep alive query – select 1 from dual –
  3. Variable length encoding schemes – define varchar using character length instead of byte size –
  4. Install VoltDB client jar into local maven repository –

NULL in SQL operations in Netezza

The equivalent of Oracle’s Dual table in Netezza is a view named _v_dual

--select 1/null from _v_dual;
--select 1*null from _v_dual;
--select 1 + null from _v_dual;
--select 1 - null from _v_dual;
--select DECODE(0, 0, 1, 2/DECODE(0,0, null, 0)) from _v_dual;
--select DECODE(3, 0, 1, 2/DECODE(0,0, null, 0)) from _v_dual;
--select DECODE(4, 0, 1, 2/DECODE(5,0, null, 0)) from _v_dual;
Error: ERROR: Divide by 0

SQLState: HY000
ErrorCode: 1100

--select DECODE(4, 0, 1, 2/DECODE(5,0, null, 1)) from _v_dual;

Building a Personal Website.

This site is built using

  • Ubuntu
  • AWS : EC2, Elastic IP
  • Apache : 3 Sites, Virtual Hosts, a2ensite, a2enmod, mod proxy http, customized document Root
  • JRuby on Rails
  • JQuery
  • Twitter Bootstrap: JS and CSS
  • Google Drive: Docs and Presentations
  • Atlassian: Jira and GreenHopper
  • Git
  • Google Analytics
  • WordPress

Below is the sequence of tasks that I took thus far

  • Setup an EC2 node
    • Purchase a reserved instance of m1.large type
    • choice of m1.large was made based on three factors
      • Considerable memory may be required for the large number of applications that will be setup
      • Higher CPU may not be required given that the load will not be high
      • micro, small, medium may be too small for the memory requirements and m2.xlarge would be too high
    • Purchase a reserved instance of shortest term possible (In this case I purchased a node from a 3rd party for 1 month term)
    • Consider a heavy utilization instance since, the node has to be up and running all the time and load cannot be predicted.
    • Take Ubuntu 13.04 AMI
  • Setup Apache
  • Setup MySQL
  • Purchase and Configure a domain name
  • Purchase Jira Starter license and Setup Jira with MySQL as backend
  • Setup WordPress with MySQL as backend
  • Setup a private repository on to host website project
  • Configure website with Twitter bootstrap starter template
  • Configure website to centralize all content


Big O comparison

logx < x < x*logx < x^2 < 2^x

  1. Each part of the above statement is true beyond a certain value of x and
  2. All of the above statement is also true for x > 2 (if log is taken at base 2)
  3. Now what’s missing above is how does 2 ^ logx compare to the above values? Hmmm how about an image?
  4. Here’s how, below is the diagram comparing 2 ^ logx to the rest.(Octave code). Remember a ^ log b is same as b ^ log a hence 2 ^ logx is same as x ^ log2. which is less than x since  log2 to the base e < 1.

logx  <  2 ^ logx  <  x  <  x*logx  <  x^2  <  2^x


Why is the Big O of merge sort not 2^logn?

Even after hearing. learning, programming merge sort several times in the past, suddenly I am caught in a logical knot today again. I know that Big O of Merge Sort is n*logn but

  • since there are atleast 1 comparisons at the last step and
  • 2 comparisons at the last but one step and
  • 4 comparisons in the last but second steps and so on..
  • isn’t 1 + 2 + 4 + ….n/2 =  (1 – 2^logn)/(1-2) = 2^logn
  • i.e. it seems the merge sort is O(2^logn).

Here’s how I untangled it by writing down a break down of all the steps involved.

  1. lets say values is an int array of size n.
  2. During merge sort we break it two halves of size n/2 and 
  3. continue to perform the same operation on each of the halves until we get arrays of sie 1
  4. at this point we merge two arrays of size 1 by comparing them and we get array of size 2
  5. Similarly 1 comparison is required at least n/2 times
  6. this is because we are breaking n into n/2 two size arrays recursively.
  7. From above it seems that we need to make n/2 1 comparisons and n copy statements in the worst case i.e. n/2 + n are the total number of operations required to come up with n/2 sorted 2 size arrays.
  8. Taking this further we need (n/4)*2 + n here there are 2 comparisons required and this continues
  9. given above the total number of steps seems to be = (n/2 + n) + ((n/4)*2 + n) + …
  10. i.e. nlogn + n/2 * logn
  11. i.e. 3n/2 logn

QED and my mind is restored, the logical mistake that I committed was assuming that 1 comparison was required only once and 2 comparisons were required only one etc. but in fact

  1. 1 comparisons was required n/2 times
  2. 2 comparison were required n/4 times and so on untill
  3. n/2 comparisons which was required only once
  4. i.e. n/2 + n2+ … = n/2 * logn comparisons in total