Friday, March 27, 2009

C - A New Golden Age

I've been noticing a new trend lately - more people seem to be eschewing C++ and moving to C. There seems to be several major reasons why this is happening, and it's an interesting story of the evolution of programming.

One major reason is simplicity - C is substantially simpler than C++. Furthermore, given the level of complexity, including ABI issues, there are very real arguments as to if the complexity is worth the benefits. At least one start-up I know had made a policy to use C over C++ for their real-time systems. Given the need for memory intensive, CPU bound processes, one is often forced to use non-VM languages like C, C++. Given the practical choice between the two, choosing the former can be a better choice!

The other reason is interoperability - C has a standardized application binary interface (ABI) that is the system-default binary interface on Unices. Furthermore, C's ABI does not change, and will not change. Compare and contrast to C++ which does not have cross-vendor compatibility (soon apparently), and may not be compatible across point releases (GCC 4.1 not compatible with GCC 4.2 for example).

As a corollary to the last note, many people are opting to use more 'scripting' or higher level languages - such as Python, Ruby, Lisp or OCaml. All these languages natively interface directly with C, but not always with C++. If they do support C++ they support a subset - such as no exceptions - thus forcing one to wrap C++ libraries to play nice. One can write the memory-intensive, CPU bound parts in C, and write the higher level logic in Python, thus getting the benefits of both worlds when necessary.

It seems like now is the time for a new Renaissance of C - there are many libraries out there which provide highly advanced functionality wrapped up in C (eg: gtk++, gnome, libapr). Basing your next performance sensitive project on C may be the seemingly anachronism your team needs to escape C++ compile hell.

Friday, March 13, 2009

HBase 0.20 Dev Preview

As a major contributor to HBase, I have had the privilege of seeing some of the new features of the next release in action before anyone else, in part because I developed them. I can say that 0.20 is shaping up to be an amazing release that will wow many people (I hope!).

In my last blog post I went into the historical of how a new file format came about. Well as we stand a few weeks later, history is now fact, and HBase 0.20 is now based on HFile (H for HBase, they rejected RFile). With a minimal integration, HFile provides a 5x baseline speedup compared to HBase 0.19.x. This is the simplest integration - not changing how HBase stores keys, or doing anything else. Additional performance improvements must come at the cost of more intrusive fixes.

We estimate that there is another 5x speed improvement by doing more intrusive fixes - the first of which is to change the way HBase stores keys in the store. While doing that we pave the way to avoid object allocation and copying bytes around. This also involves turning lots of functions from taking byte[] to taking byte[],int,int. It also means instead of copying bytes in and out of new objects, often just to compare 2 keys, instead create and use 0-copy pointers into existing arrays.

This also preps HBase for a new RPC that pushes the serialization out to the client. The client will eventually serialize keys and values into a format that will allow the server to shuffle bytes directly in and out of HFiles.

This exercise has been also been about how to make Java CPU and memory efficient. There has been many lessons on how exactly does Java use memory, what are per-object overheads, hidden APIs (look at java.lang.instrument) and just general bad-ass Java programming. I encourage anyone who has an interest in fast and efficient Java programming to visit us on irc.freenode.net channel #hbase.