My latest project has been to import data in to HBase. The data profile is very simple, 6 integers, serialized using Thrift and the natural key is 3 of those 6 integers.
The table design is simple, a single column family, and row key is "%s:%s:%s" - the stringified version of the business key integers. The stored data is a simple 6 integer thrift struct serialized using the binary serialization. The data source is flat files.
Loading data is not entirely trivial since there a number of options. The native HBase client is in Java and is fairly thick and intelligent. Since our data is stored on multiple servers, the client has to do the work of finding which server and then talking to that server.
If you are not interested in using the Java client, the option is to use a Thrift gateway. This is a Java thrift server that provides a thrift API and uses the HBase client to talk to the HBase cluster to get things done.
So my initial attempt was a Python client that talks via the Thrift gateway. This allowed me to load about 20 million records in 6 hours.
My next step was to directly interface with the HBase API using Jython. Jython is only up to CPython 2.2, but a 2.5 release is in the works. The roadmap talks about 3.0, which is nice.
My first attempt at a Jython/HBase/JVM program netted me a 5 times speedup, allowing me to load 20 million records in 70 minutes. I was getting about 4 inserts per milliseconds. This scales with multiple writers as well.
Thanks to a hint by Stack on IRC I added these lines of code:
And now I achieved another speedup. Currently with these lines I can import between 30-50 rows per ms. This allows me to import 20 million records in 12 minutes! This also scales up to at least 6 times parallel - meaning I can import 180-300 rows per ms. That is Millisecond, aka 1/1000th of a second.
This represents a net 30x speedup from using the Thrift Gateway to using the native HBase API and tweaking buffering settings.