Wednesday, June 24, 2009

HBase 0.20

My HBase colleagues and I have been working hard on the next generation of HBase - the upcoming 0.20 release! We recently published an alpha release and are working on stabilizing and testing for the final 0.20 release candidate.

The release sports many new and cool features:
- New file format offers improved efficiency and features
- New read paths are faster and less memory intensive
- Zookeeper integration - allows multiple masters
- New API, better delete semantics, improved caching, and more!

In addition to code, I've also taken a bit of time out to speak at a recent conference here in SF. There were some other great projects speaking, all videotaped and uploaded. Check it out, lots of smart people talking on some fascinating subjects.

Wednesday, May 6, 2009

HBase update and 0.20

Thanks to a long list of contributions at many levels in HBase, the Hadoop Project Management Committee has voted to approve my committer status. This is reflects important contributions to making HBase 0.20 stable and performant.

In the mean time, HBase 0.20 is on track. Recent changes involved porting HBase to use Hadoop 0.20, and the introduction of the real-time Lzo compression library. Initial tests indicate that using Lzo over GZ can nearly double read and write speed.

But why use compression? Surely the extra time to actually compress and decompress all records would make things too slow? With the use of the lzo "real time" compression library, one can achieve reasonable compression but at a very low latency/cpu cost. The benefits include a more efficient use of bandwidth and disk. Instead of reading a 64k block, that block compressed might be 30k instead to represent the same data (a 2.1 compression ratio seen in my data with lzo). Now your reads from HDFS are twice as fast, and double the amount of data you'll actually be able to store.

The next steps in HBase 0.20 are a series of fundamental changes to how deletes are stored in the store files (HFile), and how reads and scanners work. These should improve performance via better algorithms and a more efficient processing of insert and delete entries in multiple HFiles.

In the mean time, HBase trunk works and is stable. If you have a data set that is small values or you need more speed, give it a shot. It's not a release quality and future commits may invalidate your entire installation, but I have loaded hundreds of gigs of data and successfully read it out again. Maybe it will work for you?

Friday, March 27, 2009

C - A New Golden Age

I've been noticing a new trend lately - more people seem to be eschewing C++ and moving to C. There seems to be several major reasons why this is happening, and it's an interesting story of the evolution of programming.

One major reason is simplicity - C is substantially simpler than C++. Furthermore, given the level of complexity, including ABI issues, there are very real arguments as to if the complexity is worth the benefits. At least one start-up I know had made a policy to use C over C++ for their real-time systems. Given the need for memory intensive, CPU bound processes, one is often forced to use non-VM languages like C, C++. Given the practical choice between the two, choosing the former can be a better choice!

The other reason is interoperability - C has a standardized application binary interface (ABI) that is the system-default binary interface on Unices. Furthermore, C's ABI does not change, and will not change. Compare and contrast to C++ which does not have cross-vendor compatibility (soon apparently), and may not be compatible across point releases (GCC 4.1 not compatible with GCC 4.2 for example).

As a corollary to the last note, many people are opting to use more 'scripting' or higher level languages - such as Python, Ruby, Lisp or OCaml. All these languages natively interface directly with C, but not always with C++. If they do support C++ they support a subset - such as no exceptions - thus forcing one to wrap C++ libraries to play nice. One can write the memory-intensive, CPU bound parts in C, and write the higher level logic in Python, thus getting the benefits of both worlds when necessary.

It seems like now is the time for a new Renaissance of C - there are many libraries out there which provide highly advanced functionality wrapped up in C (eg: gtk++, gnome, libapr). Basing your next performance sensitive project on C may be the seemingly anachronism your team needs to escape C++ compile hell.

Friday, March 13, 2009

HBase 0.20 Dev Preview

As a major contributor to HBase, I have had the privilege of seeing some of the new features of the next release in action before anyone else, in part because I developed them. I can say that 0.20 is shaping up to be an amazing release that will wow many people (I hope!).

In my last blog post I went into the historical of how a new file format came about. Well as we stand a few weeks later, history is now fact, and HBase 0.20 is now based on HFile (H for HBase, they rejected RFile). With a minimal integration, HFile provides a 5x baseline speedup compared to HBase 0.19.x. This is the simplest integration - not changing how HBase stores keys, or doing anything else. Additional performance improvements must come at the cost of more intrusive fixes.

We estimate that there is another 5x speed improvement by doing more intrusive fixes - the first of which is to change the way HBase stores keys in the store. While doing that we pave the way to avoid object allocation and copying bytes around. This also involves turning lots of functions from taking byte[] to taking byte[],int,int. It also means instead of copying bytes in and out of new objects, often just to compare 2 keys, instead create and use 0-copy pointers into existing arrays.

This also preps HBase for a new RPC that pushes the serialization out to the client. The client will eventually serialize keys and values into a format that will allow the server to shuffle bytes directly in and out of HFiles.

This exercise has been also been about how to make Java CPU and memory efficient. There has been many lessons on how exactly does Java use memory, what are per-object overheads, hidden APIs (look at java.lang.instrument) and just general bad-ass Java programming. I encourage anyone who has an interest in fast and efficient Java programming to visit us on irc.freenode.net channel #hbase.

Friday, February 13, 2009

Scalability: What are YOU doing about it?

I've been working on HBase lately. The system is cool, integrates nicely with Hadoop and HDFS and has an awesome community. But there is one elephant in the room: raw seek and read speed of HBase.

Due to the speed issue, hbase is ruled as interesting but too slow in some blog posts. Since serving a website directly out of HBase is on the todo list for many users (including me), I took some time out to do something about it.

At the recent HBase LA Hackathon, a number of core system implementers and commiters met and hung out with users and new users. On the agenda was planning the next 0.20 release.

It became apparent that there were two massively needed features:

  • ZooKeeper integration for reliability

  • Faster HBase for website serving



While both of these are very interesting to me, the faster HBase was of a more critical need for me. I sat with Stack one of the lead developers to do some performance testing and profiling on a new file format.

When retrieving data from disk, the file format is a performance linchpin. It provides the mechanisms for turning blobs of data on disk back into the structured data the clients expect.

We were evaluating a new proposed file format (known as TFile), and while the raw numbers on the file format seemed good, profiling exposed some worrysome trends. The details are too detailed, but my intuitive guess is the layered stream micro-architecture of the new file format was not leveraging explicit block efficiency, and instead was relying on the layers of streams to block read instead. The problem is it becomes hard to control, and loss of control reduces performance in this case.

Furthermore our tests were being done on local disk which has the benefit of OS block caching. When you move to HDFS, things change - data gets are expensive network RPCs, and you want to cache as much as possible as well as retrieve as much as possible at once.

So in parallel with Stack, he went forward and ripped out stream layers from TFile, and I decided to start anew and write a new block-oriented file format. Once both got to a sufficient space, Stack ran performance tests, and the results were conclusive, the new format, dubbed HFile, is vastly superior to existing formats (Mapfile).

My new file format HFile (previously RFile) is block oriented (you read block chunks at a time) and has an explicit block cache via an interface. Reading, all the data is stored and referred to by ByteBuffer reducing unnecessary duplication. The streaming features of other formats were removed, since with HBase it is necessary to store the key/value in memory at least once during the write (into the so-called "memcache"). Reducing that complexity while explicitly managing blocks has resulted in a file format that is faster, reduces memory overhead (See: ByteBuffer) and improves in-memory block caching all in one go.

The current status is that HFile is on track to be the new official file format for HBase 0.20. Development is active, and while there is nothing for people to use yet, you can see the glorious details at github: http://github.com/ryanobjc/hbase-rfile/commits/rfile

Join us on IRC as well - #hbase on freenode.

Friday, January 9, 2009

Performance of HBase importing

My latest project has been to import data in to HBase. The data profile is very simple, 6 integers, serialized using Thrift and the natural key is 3 of those 6 integers.

The table design is simple, a single column family, and row key is "%s:%s:%s" - the stringified version of the business key integers. The stored data is a simple 6 integer thrift struct serialized using the binary serialization. The data source is flat files.

Loading data is not entirely trivial since there a number of options. The native HBase client is in Java and is fairly thick and intelligent. Since our data is stored on multiple servers, the client has to do the work of finding which server and then talking to that server.

If you are not interested in using the Java client, the option is to use a Thrift gateway. This is a Java thrift server that provides a thrift API and uses the HBase client to talk to the HBase cluster to get things done.

So my initial attempt was a Python client that talks via the Thrift gateway. This allowed me to load about 20 million records in 6 hours.

My next step was to directly interface with the HBase API using Jython. Jython is only up to CPython 2.2, but a 2.5 release is in the works. The roadmap talks about 3.0, which is nice.

My first attempt at a Jython/HBase/JVM program netted me a 5 times speedup, allowing me to load 20 million records in 70 minutes. I was getting about 4 inserts per milliseconds. This scales with multiple writers as well.

Thanks to a hint by Stack on IRC I added these lines of code:

table.setAutoFlush(False)
table.setWriteBufferSize(1024*1024*12)

And now I achieved another speedup. Currently with these lines I can import between 30-50 rows per ms. This allows me to import 20 million records in 12 minutes! This also scales up to at least 6 times parallel - meaning I can import 180-300 rows per ms. That is Millisecond, aka 1/1000th of a second.

This represents a net 30x speedup from using the Thrift Gateway to using the native HBase API and tweaking buffering settings.