Wednesday, May 6, 2009

HBase update and 0.20

Thanks to a long list of contributions at many levels in HBase, the Hadoop Project Management Committee has voted to approve my committer status. This is reflects important contributions to making HBase 0.20 stable and performant.

In the mean time, HBase 0.20 is on track. Recent changes involved porting HBase to use Hadoop 0.20, and the introduction of the real-time Lzo compression library. Initial tests indicate that using Lzo over GZ can nearly double read and write speed.

But why use compression? Surely the extra time to actually compress and decompress all records would make things too slow? With the use of the lzo "real time" compression library, one can achieve reasonable compression but at a very low latency/cpu cost. The benefits include a more efficient use of bandwidth and disk. Instead of reading a 64k block, that block compressed might be 30k instead to represent the same data (a 2.1 compression ratio seen in my data with lzo). Now your reads from HDFS are twice as fast, and double the amount of data you'll actually be able to store.

The next steps in HBase 0.20 are a series of fundamental changes to how deletes are stored in the store files (HFile), and how reads and scanners work. These should improve performance via better algorithms and a more efficient processing of insert and delete entries in multiple HFiles.

In the mean time, HBase trunk works and is stable. If you have a data set that is small values or you need more speed, give it a shot. It's not a release quality and future commits may invalidate your entire installation, but I have loaded hundreds of gigs of data and successfully read it out again. Maybe it will work for you?