Compression and Analytics : Querying Compressed Files – Part 1

Efficient ways of searching compressed files, that are comparable to searching uncompressed files, would benefit several applications, analytical or otherwise.

Specifically if space savings can be achieved without adding significant costs in searching for content then cost savings can be achieved by reducing number of disks and thereby reducing number of nodes/servers needed to store the data.

Some relevant ideas (some can be applied in combination with others)

  1. Pre-processing the file by reading it once and extracting metadata that can be used to quickly match against predicates ex: partitioning, file-level indexes, min/max values of a column etc. ex: Netezza (compresses records after pre-processing and uses FPGA to decompress files), Hive(ex: using ORC serde to store data in compressed columnar form) use some of these approaches on compressed files.
  2. For files with small alphabets code compression would help. For example 2-bit code(similar to bitmaps) compression on a genome file containing A,C, T, G characters – gives 75% space saving (or 25% compression ratio) compared to an original ASCII file containing the characters themselves. Both compressed and uncompressed files can be read in approximately same time since the alphabet contains very few letters.
  3. Flip the search by compressing search string using compression algorithm’s specifics i.e. translate search string into a compressed binary format that can be quickly compared against the binary compressed content. (NOTE: cannot be used for regular expression searches yet could have wide applicability). For example
    • For code compression, code the search string itself.
    • For Huffman pre-process to read and build prefix-free codes trie and then code the search string.
    • For LZW pre-process once by decompressing the file completely to rebuild code table and then code the search string using the code table.
    • For LZ77(used by gzip) pre-process once per sliding window and generate a separate file (a serialized symbol table with block ids as key and a trie of repeated strings may be??) which can be later used to check a search string is present in a block.


For illustrative purposes below is an experiment, its implementation and some results. It does not apply above ideas instead tries to establish baseline by creating an analogy for grep and zgrep(decompress and search) tools.

  1. Implement GREP to search all matches of a regular expression on a stream reader.
  2. Unzip .gz files compressed files and retain a copy of original .gz file
  3. Read uncompressed file and GREP.
  4. Read compressed file, decompress and stream the content to GREP.
  5. Repeat and measure difference in time ( TO-DO NOTE: memory is another parameter to measure)

To ensure apples-to-apples comparison use one language for all of the above,


Source code for the implementation is here.

  • It is based on GREP implementation using NFAs (available here) based on Kleene’s theorem which states equivalency between DFA and Regular Expressions.


  • For 22K compressed file and 225K uncompressed file and 1000 trials – Average Time difference uncompressed – compressed : -0.002 seconds i.e. the additional cost of decompressing and searching is 2 ms
  • For 147K compressed file and 2.3M uncompressed file and 100 trials – Average Time difference between searching uncompressed vs compressed files is : -0.009 seconds. i.e. the additional cost of decompressing and searching is 9 ms.
  • For 34M compressed file and 229M uncompressed file and 10 trials – Average Time difference between searchinguncompressed – compressed : -1.773 seconds i.e. the additionalcost of decompressing and searching is 1.773 seconds

Next Steps

  • Attempt implementation of Flipped Search on LZW and LZ77 and compare results.

Java 8 Streams(Parallel), Lambda expressions and Amdahl’s law observations


Java has fantastic language features ex: Generics, Annotations and now Lambda expressions. These are complimented very well by several core capabilities packaged within a JDK ex: collections (also newer concurrency package), multi-threading.

“Streams” – a feature that Java made available in 8 in combination with lambda expressions can be extremely expressive, almost akin to functional programming (leverages Functional interfaces). It also makes writing code to leverage muti-processors, for parallel execution, much more simpler.

Purely for illustrative purposes I thought of putting together a scenario to understand these features better.

Below is the description of the scenario:

  • 100,000 products have monthly sales over one year period.
  • Requirement is to sort products by their sales of a specific month.
  • If we were to do this on a single processor – the filter, sort and collect operations will happen sequentially with sort being the slowest step.
  • If we were to do this in a multi-processor system – we have a choice to either do them in parallel or sequentially.

(Caveat all operations do not uniformly benefit with parallel execution ex: Sorting needs specialized techniques to improve performance while filtering should theoretically execute much faster straight away. Concurrency and List implementations esp. ArrayList is again a non-trivial consideration.In addition streams add a different dimension. NOTESorting in parallel, using streams or otherwise, is unstable Stable refers to the expectation that two equal valued entries in a list appear in their original order post sorting as well.)

To simplify things we could break this into 4 smaller test cases

  1. In stream execute filter and collect and then sort the list
  2. In parallel stream execute filter and collect and then sort the list
  3. In stream execute filter, sort and collect the list
  4. In parallel stream execute filter, sort and collect operations.(NOTE: don’t try this as noted above)

Timing 100 runs of each of the above 4 test cases and making observations against Amdahl’s law, which helps predict the theoretical speedup when using multiple processors, felt like a fun illustrative way to understand – streams, parallel streams, sorting, lambda expressions and multi-processor runs.

Here’s the code for it and below are results of one complete execution.

  • For case 2 vs case 1: Average Speedup = 1.2291615907722224 Amdahls percentage = 0.18190218936593688 and Speedup = 1.1000508269148064
  • For case 4 vs case 3:Average Speedup = 2.1987344172707037 Amdahls percentage = 0.8987062837645722 and Speedup = 1.8160459562383011

Distributed Database Design – Notes

  • Most distributed systems(ex: HBase, VoltDB) require coordination among-st its nodes for several/all of its tasks. Coordination can be either taken up by the system itself or the system can give this responsibility to a separate coordination system ex: Zookeeper.
  • The coordination system itself needs to be fault tolerant, preferably very high throughput, scalable (preferably horizontally scalable) system i.e. it is a distributed system on its own right.
  • Zookeeper Paper (Wait free coordination for large distributed systems) has these key ideas
    • Provide wait-free base features and let clients of zookeeper implement higher order primitives on-top of these simpler primitives ex: read/write locks, simple locks without herd effect, group membership, configuration management.
    • The base features are
      1. sequential/linearizable writes at leader node
      2. All reads are locally executed by the client
      3. FIFO execution of client requests
      4. Optimistic locking i.e. Version based operations
    • The underlying mechanisms for linearizables writes is
      • A protocol for Atomic broadcast — zookeeper defines  this as guaranteed delivery of a message to all nodes (ex: a write message from leader)
      • ZAB(Zookeeper Atomic Broadcast) seems to be essentially a variant of two phase commit protocol. An article explaining it further is here.
  • VoltDB is a in-memory, fault tolerant, ACID compliant, horizontal scale system. Some of its architecture choices are here. The key ideas are
    • Single thread operations
    • No client controlled transactions. instead each stored procedure runs in a transaction.
    • Partitioning and shared-nothing cluster (with replication for fault tolerancve)
    • leveraging/support deterministic operations. Instead of writing at master and propagating changes to all nodes containing replicas, each node which has a replica executes same transaction in parallel.
  • VoltDB is based on this paper which tries to analyze operations involved the OLTP systems and its key observations are
    • buffer management, locking, logging and latching are the costliest operations in OLTP….unless one strips out all of these components, the performance of a main memory-optimized database is unlikely to be much better than a conventional database where most of the data fit into RAM.
    • In a fully stripped down system — e.g., that is single threaded, implements recovery via copying state from other nodes in the network, fits in memory, and uses reduced functionality transactions — the performance is orders of magnitude better than an unmodified system.
  • VoltDB uses  Zookeeper for leader election. More here.
Some more notes on VoltDB:

Graph processing in query languages – Part III using Hive and HBase


  • HBase is great at large number of (on-demand) columns, random writes and reads and maintaining history
  • Against each node(row key), treat every neighbor as a column with a fixed (can be extended for weighted graphs) value of 1. Include self loop.
  • Pick node and least neighbor and insert into another hbase table with just node and component-id(least neighbor)
  • Use joins (based on GIMV) to refine component-ids. Iterate till convergence is achieved. (In several ways join is also similar to a BSP approach in Pregel/Giraph)
  • Use hive as the language for above steps

Solution is here


  • Illustrative log on a small graph is here. Its summary is here
  • Benchmark has not been done


  • Incremental processing of graphs is feasible
  • Should theoretically scale for web scale graphs – sparse or dense.
  • Though based on GIMV – Matrix Vector multiplication isnt needed. GIMV is a very useful base to model solutions like above for other graph problems that can be solved by it.



Gimv hbase

Clone all github and bitbucket repos in one go

Clone all your github repos in one command

curl -u <<username>> -s<<username>>/repos?per_page=200 | ruby -rubygems -e 'require "json"; JSON.load( { |repo| %x[git clone #{repo["ssh_url"]} ]}'

Clone all your bitbucket repos in one command

#script to get all repositories under a user from bitbucket
#Usage: [username]

curl -u ${1}${1} > repoinfo
for repo_name in `grep -oE '\"name\": "[^"]*"' repoinfo | cut -f4 -d\"`
 git clone${1}/$repo_name.git



Setting up an AWS EC2 node as jenkins slave to a master hosted locally

Setup a password less ssh authentication for <<user>> on EC2 node

  1. Login to AWS EC2 node
  2. su to the user where you would setup jenkins slave
  3. create a key value pair
  4. add pub part of the key value pair to the authorized_keys
  5. copy the private part of the key value pair into a local file. Name it as <<user>>.pem
  6. chmod 400 <<user>>.pem
  7. test ssh to the ec2 node as <<user>> from local
    1. ssh -i <<user>>.pem <<user>>@ec2-node

Add ec2 node as slave node in jenkins

  1. scp the pem file to the jenkins master server
  2. Login to jenkins UI as admin
  3. Go to Jenkins —> Manage Jenkins –> Manage Nodes —> New Node
  4. provide /home/<<user>> as remote FS root
  5. click on advanced button
  6. provide hostname and user name
  7. leave password blank
  8. specify absolute path to the pem file for this <<user>>
  9. save and launch slave

Configure jenkins jobs to run on ec2 slave node

  1. go to jenkins job that you want to run remotely
  2. click on “Restrict where this project can be run”
  3. specify the slave name as “Label Expression”
  4. save the job and build.

Measuring performance and understanding user navigation through javascript

Measuring performance and understanding user navigation is supported by most of the browsers now natively through their implementations of “Navigation & Timing” W3C spec. It doesn’t need any additional library.

For example: The below code snippets work correctly on both IE(tested on version is 11.x) and Chrome (tested on 40.x).

Measure page load time.

var perfData = window.performance.timing;
var pageLoadTime = perfData.loadEventEnd – perfData.navigationStart;

Measure request response time.

var connectTime = perfData.responseEnd – perfData.requestStart;

Several other metrics can be captured directly using the window.performance object ex: window.performance.memory returns information about memory consumption of the page, window.performance.navigation.type tells if a page load is triggered by a redirect, back/forward button or normal URL load.

Measure Ajax Requests

app.render = function(content){ 
myEl.innerHTML = content; window.performance.mark('end_render'); 
window.performance.measure('measure_render', 'start_xhr', 'end_render'); 
var req= new XMLHttpRequest();'GET', url, true); 
req.onload = function(e) { window.performance.mark('end_xhr'); window.performance.measure('measure_xhr', 'start_xhr', 'end_xhr'); app.render(e.responseText); } 

Measure Custom events

Very similar to measuring Ajax Requests using mark and measure.

More information Here: