As a major contributor to HBase, I have had the privilege of seeing some of the new features of the next release in action before anyone else, in part because I developed them. I can say that 0.20 is shaping up to be an amazing release that will wow many people (I hope!).
In my last blog post I went into the historical of how a new file format came about. Well as we stand a few weeks later, history is now fact, and HBase 0.20 is now based on HFile (H for HBase, they rejected RFile). With a minimal integration, HFile provides a 5x baseline speedup compared to HBase 0.19.x. This is the simplest integration - not changing how HBase stores keys, or doing anything else. Additional performance improvements must come at the cost of more intrusive fixes.
We estimate that there is another 5x speed improvement by doing more intrusive fixes - the first of which is to change the way HBase stores keys in the store. While doing that we pave the way to avoid object allocation and copying bytes around. This also involves turning lots of functions from taking byte to taking byte,int,int. It also means instead of copying bytes in and out of new objects, often just to compare 2 keys, instead create and use 0-copy pointers into existing arrays.
This also preps HBase for a new RPC that pushes the serialization out to the client. The client will eventually serialize keys and values into a format that will allow the server to shuffle bytes directly in and out of HFiles.
This exercise has been also been about how to make Java CPU and memory efficient. There has been many lessons on how exactly does Java use memory, what are per-object overheads, hidden APIs (look at java.lang.instrument) and just general bad-ass Java programming. I encourage anyone who has an interest in fast and efficient Java programming to visit us on irc.freenode.net channel #hbase.