Don’t write your own persistence layer: why we chose RocksDB
Our use case
In a nutshell, Quasar helps you anticipate the unpredicted.
The idea is that if you want to go beyond trivial alerting, you need to work on massive amounts of data to look for precursors, hidden in weak signals. That means you must record absolutely everything going on in your business environment and keep it for as long as possible. That quickly goes to the petabyte scale!
Quasar is the software platform that gives you both volume (like a data warehouse) and low latency (like an operational database) thanks to significant innovations in indexing and compression.
Technically, that means Quasar is optimized for write-heavy scenarios and must be able to feed models at low latency while providing analytical capabilities to do trend analysis. From a data management point of view, this is squaring the circle.
As you can guess, since our #1 job is to write data as efficiently as possible, the persistence layer – the piece of software that writes the data to disk – of Quasar is an essential piece of the equation.
Yet, we opted not to write our own. Coming from a team that wrote its custom serialization library using template C++ metaprogramming, that’s telling.
Surprised? Read on to find out why.
In the beginning…
When I wrote the prototype of Quasar, I hacked together a mundane persistence layer. Data was written into files of fixed size and never sorted. There was a crude lookup table to find the relevant piece of data, and that was it. Write performance was acceptable but read performance was not great, and there was no collection of dead items.
I happen to have written file system drivers and virtual memory managers, so I had an idea of what writing a persistence layer meant and what you could ask to the file system before killing performance. A filesystem is not a magic piece of code; it’s software with features, properties, and limits.
For example, you could decide to memory map tables and write them individually in separate files. However, you will quickly run into reliability and performance problems because the paging algorithms are unsuitable for database workloads (And the thing you think is stored on disk? Hope you like gambling).
The other major difficulties with persistence is that you can’t YOLO your way around the corner cases. And they are legion.
Once I had most of the skeleton of the first alpha of Quasar, I realized that I would need to go beyond a trivial persistence layer. I was not super happy about it because most of the existing implementations I found were not great. And I didn’t want to write my own.
And then came LevelDB
As I was contemplating the absurdity of existence and considering a career in bakery, LevelDB popped out of nowhere. The guys at Google were, it seems, facing a similar problem than I was.
And it was written in C++, like Quasar! Integration was very smooth. It just lacked Windows support. Other than that, it was exactly what I was looking for: an embeddable key/value store to manage persistence. Very quickly, I hacked together a compatibility layer using Boost and submitted that upstream.
But wait. I didn’t tell you what LevelDB is! LevelDB is an embeddable key-value store that uses a LSM to store data.
There are two big families of data structure when it comes to disk persistence: Log Structured Merge tree (LSM) and B-Tree. If you really want to make a caricature out of it, B-Tree are better at reading and not so great at writing. LSM give you faster write speed at the cost of more CPU usage and write amplification. In theory, a LSM is slower at lookup but there are many mitigation techniques.
Feature-wise LevelDB was minimalistic, but it got the job done and fit nicely into the architecture. There was just one big problem: performance. It was much better than the pathetic excuse for a persistence layer than I wrote, but it was not great.
At first, I didn’t pay attention to it too much because we have our own caching layer that limited the impact, but for writing it was becoming a big problem. And Quasar is write-heavy. So performance must be excellent.
Fortunately, I got lucky a second time, as I was looking into that bottleneck, a new contender appeared…
In the nick of time: RocksDB
It seems that at the same time I was playing with LevelDB, a team at Facebook was doing the same, and they reached the same conclusions: “cool car bro, but could use a spoiler and nitro”.
The team at Facebook added more parallelism and more features. I ran a couple of quick benchmarks and the difference was drastic. And it even incorporated some of my suggestions for cross platform! So we made the switch to RocksDB without even looking back.
So, happy ending, everything is now perfect?
Is RocksDB reliable?
You know, it really sucks when a customer has a cluster that crashes because of a library you use, and they really don’t care that it’s some “open-source library you didn’t write”, it’s your job to make sure the software works.
RocksDB delighted us with a fair share of bugs , some of them very nasty (like this one) some milder, and some minor annoyances related to a documentation that can sometimes be contradictory. As stated above, some of these bugs resulted in very difficult calls with customers.
If you followed this story from CockroachDB, you could believe RocksDB is a bug-ridden infested mess. The reality is more subtle than that. RocksDB is a complex piece of software packed with features, and using it properly takes time and knowledge.
Despite the pain we have endured, we understand that every software has bug, and the speed at which the RocksDB team accepted our fixes gave us confidence.
As of 2021 despite servers’ crashes, arbitrary kills, VM disappearances (true story), volumes getting dismounted, and whatever devops problem you can think of, the only data loss that has been experienced was in flight data.
And every time, the cluster could restart and access the data.
Should we write our own?
Sometimes I wonder if we could squeeze more performance or features if we had our own persistence stack.
I think it is possible to write a measurably faster embedded key-value store than RocksDB. If you model against the newest NVM disks and you bypass the filesystem altogether, I could see a speed boost. The other added benefit is that you could even increase the resilience further (no file system caching or any OS shenanigans).
This is however a massive project. A company in itself. And it’s uncertain it would make a significant difference from a user point of view. Very often we realize we can do a better job at performance by tuning RocksDB or using it better.
As of today, we know we have some performance headroom by leveraging better the compaction algorithms of RocksDB. And when we run out of options, nothing prevents us, thanks to RocksDB’s modular architecture to write our own compaction strategy, cache, or even block format.
If anything, I think any major breakthrough will come from data structures. I don’t think LSM is the ultimate data structure for disk persistence. I’m pretty sure we’ll come up with something better before the decade ends. Maybe a better collaboration between software and hardware?
We explored that with a company named Levyx that had made some interesting breakthrough in that aspect, and I’m sure there’s more to do.
How to use RocksDB properly
My advice would be: use the defaults, use the basic features, and make sure you read the documentation (including the fine print). I repeat: don’t fidget the options unless you understand what they do.
If you don’t plan on storing a lot of data (the definition of “a lot of data” will be left as an exercise to the reader) in RocksDB, you don’t need to dig too deep in the documentation. However, if your usage gets intensive, it’s imperative you understand how a LSM and compaction work.
Most of the rants I have seen around RocksDB come from unrealistic expectations of what an embedded key-value store can do. At some moment, you need to spend CPU to sort the data so it can be found quickly later. And yes, it’s slower than raw disk writes.
Should you use RocksDB?
If you’re in a need of an embedded key-value store, RocksDB should be the first thing you try. It’s fast, it’s feature packed, and it almost has a documentation!
But if there’s one thing I’d like you to take away from this post it is the following: we, software engineers, have a tendency to underestimate the complexity of a task while overestimating our abilities.
Persistence is very difficult problem. It’s where software meets the laws of physics.
Don’t be a fool.
Don’t write your own.