As a java programmer, did you ever wonder if all the fuss about those mechanical sympathy principles is justified? De-structuring your objects and collections, use sun.misc.Unsafe to allocate memory off heap or to instantiate objects inside a byte array, pack related data items next to each other all to achieve spatial and temporal cache locality, is all this worth the effort?
When the requirement is to calculate statistical indicators over a metric in your data set, one effective optimization is to use a columnar store. So what what does this mean?
Suppose your data is modeled in the code snipped bellow:
As you ingest your data items, you store them in an ArrayList and if you need to calculate an average, you can do something like this:
Clean and efficient but let’s reason about what’s going on under the hood and more important, can we do better?
As you read your data from the external source your DataItem objects will be allocated wherever the JVM finds a place to fit them, in most cases, they will not all fit in a continuous memory chunk. In order to get to the metric1 double when calculating the average, the CPU has to read the pointer form the array list, figure out where it points, bring that cache line from main memory into the cache, then read the double value. Chances are the next DataItem instance you need will not be in the same cache line, so the CPU will go through the whole cycle again.
Intuitively, this does not seem very efficient. First, we have that indirection, because the array list contains pointers, not the actual data. Then the data is dispersed all over the heap, so there is no uniform access to it. Instead of getting the majority of it from L1 cache, the CPU has to go to main memory a lot.
Even if your DataItem objects would all be sequentially allocated in a continuous chunk of memory, when the CPU needs metric1, it will pull into the cache an entire set of DataItem instances, so the CPU will be able to only effectively use 1 every N doubles in the cache, so the information you need for your statistical indicator is still not very efficiently packed and you’ll still be hitting main memory a lot.
Main memory access can be 30 times more expensive than a L1 cache access in terms of CPU cycles. During all those wasted cycles spent moving data from cache to main memory, your CPU is not processing your data.
Can we do better? How about arranging the data in memory to make it more cache friendly? For example, we can try this:
Instead of packaging all elements of a record together into an object, we’re doing away with the DataItem class and instead store all values of a particular metric into a double array. And we have as many primitive arrays as we have metrics. It’s the most basic in-memory columnar store.
If we want to calculate the same average as before, we’d do something like this:
We notice that in this case all the data we need to perform the calculation on is stored in a continuous array of doubles — that is, a continuous memory chunk — and we access it in uniform strides, in a very predictable way.
When iterating over the data, the processor will fill in the cache, access a maximum of data by spending a minimum number of CPU cycles and only go to main memory again when all the data in the cache was consumed, so a lot less often than the previous case. All those CPU cycles spent moving data from main memory to cache, are now spent on calculations.
Plus, because we consistently access our data in uniform strides, the CPU will be able to efficiently pre-fetch the next data segment so that there’s a minimum waiting time when your algorithm hits the next cache miss.
Intuitively, the columnar approach seems be more efficient than the classical approach, but as an engineer, you should make decisions based on measurements not on wild guesses and intuition. So let’s measure.
Running the couple of JMH benchmarks at the end of this article will output the following results:
We notice the columnar approach measured in DoubleStreamAverageBenchmark yields almost double the sustained throughput than the traditional approach, measured ObjectStreamAverageBenchmark: 186 vs. 104 operations per second.
Naturally average execution time also come out in favor of the columnar approach: 0.005 operations per second compared to 0.013 operations per second for the traditional approach.
So there is a clear winner here: the columnar store. Now you can reproduce the benchmarks — see code bellow — go see your boss and tell him that cloud bill is going to be cut in half if they let you do that refactoring :-)
Another advantage of the columnar approach is that you’re not allocating N million objects any more, you’re allocating N arrays. That’s N million-N allocations the garbage collector does not have to track.
Of course it’s very unlikely to match these benchmark figures, it depends on your computer, on what else it’s going on in there, etc. But you’ll see the difference between the traditional approach and the new one.