Beyond Unit Tests: Building Confidence in Complex Systems

Quasar is a distributed, real-time data warehousing management system optimized for numerical data. Users rely on Quasar being up and running and serving queries accurately to make critical decisions.

How do we ensure that Quasar is reliable?

The difficulty with systems with a huge contact surface, like a DBMS, is that there is so much that can happen that you can’t reasonably anticipate. Even with a high level of development discipline, you’ll run into hard-to-resolve issues that only occur for a specific user within a particular environment.

In other words, the problem here is that you have two compounding effects feeding into each other:

A Turing-complete input
Many layers are stacked on top of each other, which are out of your control and can fail.

Users don’t care that you crashed due to a kernel bug or configuration error; all they see is that you crashed.

Key Takeaways

There is no silver bullet in software engineering. Accept it, and you’ll suffer less.
Complex systems fail in messy, unpredictable ways: concurrency, networking retries, allocator quirks, OS knobs, you name it.
Quality is a budget. Spend too little and you drown in bugs later, spend too much and you stall progress.
Catching bugs at compile time is cheap; catching them in prod is expensive.
Unit tests won’t save you, but you still need an industrial quantity of them.
Integration, chaos, performance, and “test runner” suites close the gap where unit tests can’t.
Keep the build green, always. Flaky tests kill discipline faster than bugs do.
Reliability comes from process, not gimmicks.

⏱️ 9 min read — worth it if you care about shipping systems that don’t break under pressure.

So, how do you ensure the engine is as reliable as possible?

In this series that we’ve called “Engineering Reliability at Scale” (we couldn’t find anything more pretentious, if you have a better idea, mail them, the best answer will win nothing) we’ll get you a glimpse of our development process here at Quasar and how over the years we invested in quality to bring a reliable system to the world.

In this installment, we’ll be discussing general considerations and how we approach testing at Quasar. In future posts, we’ll zoom in on specifics with concrete examples.

Getting a sense of the depth

Let’s assume you wrote a bug-free database engine. Is your bug-free database engine resistant to:

Hardware failures? And yes, the cloud exacerbates the issue.
What happens when the disk is full? Nothing good, many components (disks, file systems, Operating Systems) have “undefined behaviors” when that’s the case.
Networking error? What happens during congestion, or if users do a Denial of Service on your server, or if you are losing connectivity?
Hello, you’ve run out of memory. Figure it out!

However, your engine won’t be bug-free. Here are some examples of delicacies you’ll face:

Concurrency: transactions, locks, false sharing, and race conditions. Did I mention that Quasar is heavily multithreaded? We play this game at the nightmare difficulty level. Good luck debugging the problem that only happens “under heavy load”.
Networking errors: bugs that only happen on retry when a connection attempt failed or when you lost the connection, and you need to reconnect.
Invalid memory or undefined behavior that impacts the engine in very subtle ways, and of course, not near the origin.
Subtlety in system functions: Are you sure you read all the documentation for sys_call_magic()? Did you know that MAGIC_FLAG1 has changed meaning across glibc versions? Oh, by the way, there’s a bug in the kernel version from five years ago, you know, the one your customer is running: it just ignores the flag.
Memory allocator fun galore: fragmentation, undocumented behavior, performance challenges under pressure, and subtle interactions with the operating system’s pagination… The gift that keeps giving!
Configuration surprises: On most operating systems, numerous knobs can impact performance or reliability. Ignoring a single knob is not an option.

Process over gimmicks

There’s no silver bullet in software engineering; the sooner you accept it, the less you will suffer.

Anyone telling you that if you follow this method, use this tool, “open source it!”, switch to a different language, or “do this thing” then you won’t have issues, doesn’t understand why there are bugs in the first place.

Bugs arise from a gap between the problem and its solution. The gap can be due to a lack of technical prowess, but often it’s caused by misunderstanding or inaccurate information (which means LLMs won’t make software bug-free).

The way to improve quality is with an in-depth approach that has a process in place to catch errors as early as possible. For every problem that occurs, investigate it thoroughly and ensure that a solution is implemented to prevent the same issue from happening again.

This creates a virtuous circle of improving reliability over time.

The quality budget

Code needs to be shipped in a reasonable amount of time, and it’s a constant balance between how much you invest in quality and how fast you want to go.

If you invest nothing in quality, you’ll eventually end up going slower because any change will be challenging to validate.

If you invest too much, you lose speed on iteration and make the feedback loop longer, which paradoxically harms quality.

Ideally, you want to strike a balance that is as close as possible to the ideal for your use case.

And that’s another vital parameter: how expensive is failure? Is a rocket going to explode, or is it just going to be the score of a game not being accurate? We can all agree that the latter is unacceptable, while rockets explode all the time, so who cares?

The follow-up question is, is your end-user willing to pay for the quality investment? And remember, you always pay, either in time (waiting for a release) or hard currency.

In other terms, what’s your quality budget?

At Quasar, we have both the blessing and the curse of having a high budget. It’s a blessing because we have the time to deliver quality, but it’s also a curse because our users expect the software never to break, and when it does, they will call in the middle of your vacation. Just kidding, vacation is canceled.

Catching bugs at compile time

The earlier you catch a bug, the less expensive it is, which is why seeing a bug at compile time is ideal.

One of the most underappreciated testing tools we use is the C++20 compiler itself.

Concepts enable us to constrain templates, rejecting invalid query plans or unsafe types before the code even runs.
Static assertions encode invariants directly into the code: the build fails if they’re violated.
Metaprogramming enables us to generate entire families of type checks and tests automatically.

This turns the compiler into a first line of defense. Many classes of bugs never even reach the runtime stage. Consistently applying these principles throughout the code prevented us from encountering undefined behavior and forced us to question our actions.

Let’s do some unit testing

In the context of Quasar, unit tests mainly help with adding features with confidence, but they rarely catch bugs or issues. This is because they can’t explore the combinatorial space of database execution paths.

That’s not an excuse to not write an industrial quantity of unit tests. If anything, you probably have fewer tests than you need. I have yet to see a project where there are “too many tests”. If you think you have seen too much testing, mail us an example for a chance to win nothing.

At Quasar, here is how we do it:

We use Boost Test as a framework, and unit tests are executed at every commit to the master and release branches. Don’t overthink which framework you use for unit testing; the impact isn’t that big.
Lots of unit tests, and at the conception phase, make sure your code has some introspection capacity to test for internal states. Don’t just rely on external state for testing; you’ll miss a lot. Don’t obsess over encapsulation too much, nor over design purity. Your user doesn’t really care that your class was a proper implementation of a design pattern if the shell crashes on them.
You know you’re doing great if, as you write tests, they raise questions that need to be discussed with the rest of the team.
Don’t try to catch in unit testing what will be identified later: performance issues, stress tests, integration tests.
When a bug is caught in the wild, can you write a unit test to reproduce it? If yes, add it.

For a database, “it compiles and the tests pass” is just the beginning.

Let’s add even more testing

We, of course, don’t just write unit tests, and you didn’t check the interwebz today to be told that unit tests increase code quality (O’RLY?).

Quasar has a modular design, meaning we can replace a layer (for example, the network layer) with a test layer that generates errors. For instance, we have tests for clustering and networking without the full database engine (just an elementary key/value logic).

This allows us to write precise integration tests and catch complex issues.

Every test suite runs on multiple operating systems (FreeBSD, Linux, Windows, OSX), using numerous compilers (MSVC, GCC, and Clang), and we have both 32-bit and 64-bit builds. Why so many platforms? Both because our users demand it, but also because it allows us to catch heisenbugs.

The other important piece is the performance suite.

We have developed a self-benchmarking tool that runs weekly, allowing us to identify performance regressions. The benchmarking suite will import data and query that data, spawning a remote instance (to ensure data is sent over the network), and measure the query response time.

This investment paid off quickly, preventing us from shipping significant performance regressions, while also catching bugs (as the benchmark suite can sometimes crash or return errors).

The “test runner”

A critical aspect of testing is that the writing test needs to be inexpensive, which is why we developed a scriptable test engine that we call the “test runner” (if you have a better name idea, feel free to submit it for a chance to win nothing). This test engine takes a series of queries with expected results.

This test runner can be run against various configuration layouts, including in-memory-only instances, multiple nodes, and more. You write the test sequence once, and it’s automatically tested across multiple nodes.

The test runner also takes parameters about how tables are created. For example, you can use different shard sizes, validate TTL, or any parameter you can imagine. Again, write once, run many.

Keeping the build green

The build should always be green. And by ‘green’, this doesn’t mean that it successfully compiles, but that all tests pass. The problem you encounter when tests are flaky, or the build is regularly red, is that people stop paying attention.

There is, again, no silver bullet; you have to enforce regularly that the build must be kept green, and when it eventually turns red, everything stops until it’s repaired. That includes hunting down flaky tests, ensuring the build process itself is reliable, and incorporating longer-running tests into a weekly process.

Trust me, it’s much cheaper to do this.

From Tests to Confidence

In the next post of this series, we’ll go even deeper: how we intentionally break Quasar on purpose with data corruption and chaos testing, and why that makes the system stronger.