Compression and Analytics : Querying Compressed Files – Part 1

Efficient ways of searching compressed files, that are comparable to searching uncompressed files, would benefit several applications, analytical or otherwise.

Specifically if space savings can be achieved without adding significant costs in searching for content then cost savings can be achieved by reducing number of disks and thereby reducing number of nodes/servers needed to store the data.

Some relevant ideas (some can be applied in combination with others)

  1. Pre-processing the file by reading it once and extracting metadata that can be used to quickly match against predicates ex: partitioning, file-level indexes, min/max values of a column etc. ex: Netezza (compresses records after pre-processing and uses FPGA to decompress files), Hive(ex: using ORC serde to store data in compressed columnar form) use some of these approaches on compressed files.
  2. For files with small alphabets code compression would help. For example 2-bit code(similar to bitmaps) compression on a genome file containing A,C, T, G characters – gives 75% space saving (or 25% compression ratio) compared to an original ASCII file containing the characters themselves. Both compressed and uncompressed files can be read in approximately same time since the alphabet contains very few letters.
  3. Flip the search by compressing search string using compression algorithm’s specifics i.e. translate search string into a compressed binary format that can be quickly compared against the binary compressed content. (NOTE: cannot be used for regular expression searches yet could have wide applicability). For example
    • For code compression, code the search string itself.
    • For Huffman pre-process to read and build prefix-free codes trie and then code the search string.
    • For LZW pre-process once by decompressing the file completely to rebuild code table and then code the search string using the code table.
    • For LZ77(used by gzip) pre-process once per sliding window and generate a separate file (a serialized symbol table with block ids as key and a trie of repeated strings may be??) which can be later used to check a search string is present in a block.

Experiment

For illustrative purposes below is an experiment, its implementation and some results. It does not apply above ideas instead tries to establish baseline by creating an analogy for grep and zgrep(decompress and search) tools.

  1. Implement GREP to search all matches of a regular expression on a stream reader.
  2. Unzip .gz files compressed files and retain a copy of original .gz file
  3. Read uncompressed file and GREP.
  4. Read compressed file, decompress and stream the content to GREP.
  5. Repeat and measure difference in time ( TO-DO NOTE: memory is another parameter to measure)

To ensure apples-to-apples comparison use one language for all of the above,

Implementation

Source code for the implementation is here.

  • It is based on GREP implementation using NFAs (available here) based on Kleene’s theorem which states equivalency between DFA and Regular Expressions.

Results

  • For 22K compressed file and 225K uncompressed file and 1000 trials – Average Time difference uncompressed – compressed : -0.002 seconds i.e. the additional cost of decompressing and searching is 2 ms
  • For 147K compressed file and 2.3M uncompressed file and 100 trials – Average Time difference between searching uncompressed vs compressed files is : -0.009 seconds. i.e. the additional cost of decompressing and searching is 9 ms.
  • For 34M compressed file and 229M uncompressed file and 10 trials – Average Time difference between searchinguncompressed – compressed : -1.773 seconds i.e. the additionalcost of decompressing and searching is 1.773 seconds

Next Steps

  • Attempt implementation of Flipped Search on LZW and LZ77 and compare results.

Java 8 Streams(Parallel), Lambda expressions and Amdahl’s law observations

 

Java has fantastic language features ex: Generics, Annotations and now Lambda expressions. These are complimented very well by several core capabilities packaged within a JDK ex: collections (also newer concurrency package), multi-threading.

“Streams” – a feature that Java made available in 8 in combination with lambda expressions can be extremely expressive, almost akin to functional programming (leverages Functional interfaces). It also makes writing code to leverage muti-processors, for parallel execution, much more simpler.

Purely for illustrative purposes I thought of putting together a scenario to understand these features better.

Below is the description of the scenario:

  • 100,000 products have monthly sales over one year period.
  • Requirement is to sort products by their sales of a specific month.
  • If we were to do this on a single processor – the filter, sort and collect operations will happen sequentially with sort being the slowest step.
  • If we were to do this in a multi-processor system – we have a choice to either do them in parallel or sequentially.

(Caveat all operations do not uniformly benefit with parallel execution ex: Sorting needs specialized techniques to improve performance while filtering should theoretically execute much faster straight away. Concurrency and List implementations esp. ArrayList is again a non-trivial consideration.In addition streams add a different dimension. NOTESorting in parallel, using streams or otherwise, is unstable Stable refers to the expectation that two equal valued entries in a list appear in their original order post sorting as well.)

To simplify things we could break this into 4 smaller test cases

  1. In stream execute filter and collect and then sort the list
  2. In parallel stream execute filter and collect and then sort the list
  3. In stream execute filter, sort and collect the list
  4. In parallel stream execute filter, sort and collect operations.(NOTE: don’t try this as noted above)

Timing 100 runs of each of the above 4 test cases and making observations against Amdahl’s law, which helps predict the theoretical speedup when using multiple processors, felt like a fun illustrative way to understand – streams, parallel streams, sorting, lambda expressions and multi-processor runs.

Here’s the code for it and below are results of one complete execution.

  • For case 2 vs case 1: Average Speedup = 1.2291615907722224 Amdahls percentage = 0.18190218936593688 and Speedup = 1.1000508269148064
  • For case 4 vs case 3:Average Speedup = 2.1987344172707037 Amdahls percentage = 0.8987062837645722 and Speedup = 1.8160459562383011

Distributed Database Design – Notes

  • Most distributed systems(ex: HBase, VoltDB) require coordination among-st its nodes for several/all of its tasks. Coordination can be either taken up by the system itself or the system can give this responsibility to a separate coordination system ex: Zookeeper.
  • The coordination system itself needs to be fault tolerant, preferably very high throughput, scalable (preferably horizontally scalable) system i.e. it is a distributed system on its own right.
  • Zookeeper Paper (Wait free coordination for large distributed systems) has these key ideas
    • Provide wait-free base features and let clients of zookeeper implement higher order primitives on-top of these simpler primitives ex: read/write locks, simple locks without herd effect, group membership, configuration management.
    • The base features are
      1. sequential/linearizable writes at leader node
      2. All reads are locally executed by the client
      3. FIFO execution of client requests
      4. Optimistic locking i.e. Version based operations
    • The underlying mechanisms for linearizables writes is
      • A protocol for Atomic broadcast — zookeeper defines  this as guaranteed delivery of a message to all nodes (ex: a write message from leader)
      • ZAB(Zookeeper Atomic Broadcast) seems to be essentially a variant of two phase commit protocol. An article explaining it further is here.
  • VoltDB is a in-memory, fault tolerant, ACID compliant, horizontal scale system. Some of its architecture choices are here. The key ideas are
    • Single thread operations
    • No client controlled transactions. instead each stored procedure runs in a transaction.
    • Partitioning and shared-nothing cluster (with replication for fault tolerancve)
    • leveraging/support deterministic operations. Instead of writing at master and propagating changes to all nodes containing replicas, each node which has a replica executes same transaction in parallel.
  • VoltDB is based on this paper which tries to analyze operations involved the OLTP systems and its key observations are
    • buffer management, locking, logging and latching are the costliest operations in OLTP….unless one strips out all of these components, the performance of a main memory-optimized database is unlikely to be much better than a conventional database where most of the data fit into RAM.
    • In a fully stripped down system — e.g., that is single threaded, implements recovery via copying state from other nodes in the network, fits in memory, and uses reduced functionality transactions — the performance is orders of magnitude better than an unmodified system.
  • VoltDB uses  Zookeeper for leader election. More here.
Some more notes on VoltDB:

Graph processing in query languages – Part III using Hive and HBase

Approach

  • HBase is great at large number of (on-demand) columns, random writes and reads and maintaining history
  • Against each node(row key), treat every neighbor as a column with a fixed (can be extended for weighted graphs) value of 1. Include self loop.
  • Pick node and least neighbor and insert into another hbase table with just node and component-id(least neighbor)
  • Use joins (based on GIMV) to refine component-ids. Iterate till convergence is achieved. (In several ways join is also similar to a BSP approach in Pregel/Giraph)
  • Use hive as the language for above steps

Solution is here

Testing

  • Illustrative log on a small graph is here. Its summary is here
  • Benchmark has not been done

Features

  • Incremental processing of graphs is feasible
  • Should theoretically scale for web scale graphs – sparse or dense.
  • Though based on GIMV – Matrix Vector multiplication isnt needed. GIMV is a very useful base to model solutions like above for other graph problems that can be solved by it.

References

Illustration

Gimv hbase

Clone all github and bitbucket repos in one go

Clone all your github repos in one command

curl -u <<username>> -s https://api.github.com/users/<<username>>/repos?per_page=200 | ruby -rubygems -e 'require "json"; JSON.load(STDIN.read).each { |repo| %x[git clone #{repo["ssh_url"]} ]}'

Clone all your bitbucket repos in one command

#!/bin/bash
#script to get all repositories under a user from bitbucket
#Usage: getAllRepos.sh [username]

curl -u ${1} https://api.bitbucket.org/1.0/users/${1} > repoinfo
for repo_name in `grep -oE '\"name\": "[^"]*"' repoinfo | cut -f4 -d\"`
do
 git clone git@bitbucket.org:${1}/$repo_name.git
done

References:

  1. http://haroldsoh.com/2011/10/07/clone-all-repos-from-a-bitbucket-source/
  2. https://gist.github.com/caniszczyk/3856584

Graph processing in query languages using UDAFs

Introduction

Processing graphs is a fundamental problem area that has applications in several domains ex: networks. One of the building-block problems in graph processing is “finding connected components (sets of fully connected nodes)”.

It is not straight forward to express graph problems as structured queries unless the database itself exposes primitives for graph processing ex: graph oriented databases like neo4j might do this .

In this article I will attempt to present that graph processing problems can be expressed as queries as long as the database (or the datawarehouse) allows extending its core functionality through UDAF(User Defined Aggregation Function).

I will use Hadoop/Hive as the data warehouse system and use “finding connected components” as the illustrative graph problem. I will try to build code, examine results, make observations for future work and list few optimizations.

Problem Definition

Given the following edges in the graph

N1,N3
N1,N5
N2,N1
N2,N6
N7,N8
N10,N11
N20,N4
N9,N4

Find connected components as below in SQL

[[“N1″,”N3″,”N5″,”N2″,”N6”],[“N7″,”N8”],[“N10″,”N11”],[“N20″,”N4″,”N9”]]

Related Topics

  1. Union Find and Percolation
  2. Transitive closures in SQL
  3. With Recursive in SQL

Solution Strategy

  1. Hadoop-Hive ecosystem allows for distributed processing of data using SQL-like language – HQL.
  2. Transitive closure of unlimited depth arent possible in SQL directly but
  3. UDAFs allow for processing edges row-by-row and accumulate partial results and merge them later to generate final results.
  4. QuickUnion, WeightedQuickUnion, WeightedQuickUnion With path compression are algorithms that can help identify connected components faster.

Solution and Result

UDAF : List<List<Text>> components(column1, column2)

Query : “Select components(node1, node2) from edges”

Code and log

  1. ConnectedComponents.java
  2. results.log

Code and log with optimization 1 (Use Maps instead of List to speedup “contains” checks and avoid nested loops)

  1. ConnectedComponents.java
  2. results.log

Observations for future work

Related problems that can be solved using above UDAF

Using UDAFs for graph processing as above allows for leveraging SQL for generic graph problem ex:

  1. finding if any two nodes, say N3 and N6 in above case, are connected,
  2. is graph fully connected,
  3. finding paths from one node to another, say N3 to N6
  4. finding set of largest size
  5. Incrementally solving any of the above problems i.e. as new edges get added incrementally solving above problems looking only at new edges, say edges table is partitioned by day and the above problems need to be solved on a daily basis.
  6. Build a library of graph processing UDAFs

Other possibilities

  1. Transitive closure problems in SQL can be resolved using this approach ex: finding top manager for a given employee, finding oldest ancestor.

Optimizations

  1. Use Maps instead of List to speedup “contains” checks and avoid nested loops. This would give a result like this –
    {“N9”:[“N20″,”N4″,”N9″],”N5”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N6”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N7”:[“N7″,”N8″],”N8”:[“N7″,”N8″],”N1”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N2”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N3”:[“N1″,”N3″,”N5″,”N2″,”N6″],”N20”:[“N20″,”N4″,”N9″],”N4”:[“N20″,”N4″,”N9″],”N10”:[“N10″,”N11″],”N11”:[“N10″,”N11”]}
    The final result can still be presented as a List<List<Text>>> as above, or in some other form, by modifying the final terminate() method.
  2. Use WeightedQuickUnion with path compression till its time to terminate partial processing and then covert them to Map<Text, List<Text>>.

Conclusion

UDAFs can be leveraged to build graph processing primitives in database or data warehouse that provide features to support UDAFs.

Some databases that support UDAFs are

  1. MS SQL Server 2014
  2. Oracle 11g
  3. PostgreSQL

References

  1. Union Find algorithms – http://algs4.cs.princeton.edu/15uf/
  2. Transitive Closure – http://en.wikipedia.org/wiki/Transitive_closure
  3. Aggregation functions in Oracle –http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/aggr_functions.htm
  4. Aggregation functions in MSSQL – http://msdn.microsoft.com/en-us/library/ms131051.aspx
  5. Aggregation functions in PostgreSQL –http://www.postgresql.org/docs/9.1/static/xaggr.html
  6. Neo4J DFS – StackOverflow Question –http://stackoverflow.com/questions/18112240/searching-for-strongly-connected-components-using-neo4j

Data aware Data Management System.

[IDEA2]: Data aware Data Management Systems – borrowing and building up on ideas implemented in the form of OLTP, OLAP and Big Data systems and a wide variety of other systems including distributed systems (e.g. git) – developing a specification for a uniform client interface to “DATA” irrespective of its size, nature and structure.

For example providing a uniform SQL interface to data irrespective of whether the data is 1 KB, 1 GB or 1 TB and whether it is structured, semi-structured or unstructured and whether it is string, number, xml or binary content.

Requirements

  1. The system should be aware of the data that it holds e.g. it should be able to adjust its physical schema and/or file system over a period of time as the size, nature and structure changes while ensuring that the client interface and the performance, scalability of the client remain unaffected.
  2. The system should be able to work in both auto-magic mode which would need minimum administration and also recommendation-engine mode which would generate and send out appropriate notifications at appropriate milestone events and allow the administrator or user to take action.
  3.  The system should follow convention over configuration yet be fully configurable.

Design Notes

  1. Cloud would be one of the important strategies to address the problem.
  2. A set of advanced tools which can monitor and scale up/down the cloud infrastructure automatically or programmatically would be required.
  3. A software platform layer can be built above these two layers (cloud and automation tools)which can automatically instruct the tools to scale up/down the  cloud infrastructure and then migrate/setup the physical schema and later choose the appropriate category of algorithms to work with the current form of physical schema.

OpenSourceIdeas.com

OpenSourceIdeas.com – similar to stackoverflow.com a website to capture ideas succinctly with features to share, rate, discuss. There would be only two guidelines – genuine content and freedom for everyone to use, expand, implement and/or do anything that they want to with a published idea. Patents if any that would flow from any such work need not be GPL licensed but would love if it is.

Open Source Ideas

  1. Open Source Ideas.com
  2. Data Aware Data Management System
  3. Future Forward – How about a reality show where we bring together aspiring politicians. They showcase their daily lives as activists. The winner will be sponsored to run for elections.
  4. JVM Simulator – Similar to GnuSim for 8085 microprocessor, create a Simulator for Java Virtual Machine for its instruction set.
  5. JVM Hardware – Building a hardware – microprocessor – whose instruction set is same as JVM instruction set and thus allow JVM OS.
  6. JVM OS – OS which works with JVM Hardware
  7. Jad Line Navigation – Jad decompiled sources cannot be used to locate the exceptions reported in stack trace since the line numbers specified in stack trace are from original source files while the decompiled sources are not in sync with original sources wrt line numbers.If stack trace can find out original source’s location(line number) from class files alone then Jad should also be able to do that.??
  8. Jad Program Conversion Suite – Once compiled into classes – byte code – Jad can be used to decompile to any other JVM based language and thus convert a java program into any other JVM language. http://en.wikipedia.org/wiki/List_of_JVM_languages
  9. Ozone Predictor – Predict visually using statistical modelling based on number of factors as to by when the ozone’s hole can be replenished.
  10. Account App – A mobile phone app to allow access to your basic account information along with bill details. It should suggest best plans looking at usage. A platform to cross sell/upsell services.
  11. Requirements are mapped to APIs and that’s how specifications are developed
  12. VideoDiary – Daily one video at the end of the day.
  13. Hudson how tos.
  14. Maven how tos.
  15. SVN how tos