Engineering Reliability at Scale, Part 2 Break It to Make It Stronger

Quasar is a distributed, real-time data warehousing management system optimized for numerical data. Our users expect it to stay up, stay fast, and keep their data safe while they make decisions that actually matter.

In Part 1 of this series, we looked at how we think about testing and why unit tests alone are not enough for a system with a huge attack surface.

In this instalment, we zoom in on something a bit more uncomfortable. We intentionally break Quasar to make it more reliable.

Key Takeaways

Correctness is not enough. Inputs can be hostile, malformed, or resource-exhausting, and you must design for that.
Fuzzing catches real bugs, but it is only one layer. It complements unit, integration, and performance tests rather than replacing them.
Storage lies occasionally. Cloud or on premises, you cannot assume persistence is always correct.
Rare failures must be tested intentionally. Corrupting data on purpose is the only practical way to validate repair mechanisms and failure paths.
Assertions are powerful. Encoding assumptions directly in code exposes logic errors early and helps diagnose elusive production issues.
Reliability comes from layering. No single method provides safety, but many overlapping techniques do.
Building confidence requires breaking your own software regularly and ensuring it fails safely instead of catastrophically.

⏱️ 10 min read. Worth it if you care about systems that do not fall apart the first time hardware lies or input goes rogue.

Beyond “Correct”: Security and Reliability

Getting the correct answer is table stakes. For a system like Quasar, correctness is necessary but not sufficient.

We also have to worry about what happens when the input is hostile, malformed, or simply unlucky.

Here are a few classes of problems we want to avoid that are not strictly about “result equals expected value”:

An input that causes excessive resource usage.
For example, you send 20 bytes and the server allocates gigabytes of RAM. Congratulations, you just invented a single-packet denial of service.
An invalid input that triggers a fault.
Invalid memory access, stack overflow, integer overflow, and similar issues can crash the server or, in the worst case, allow arbitrary code execution.
An input that leaks information.
For example, a bug that makes the server return credentials or internal memory content because a buffer was not properly cleared between replies.

Operating systems, runtimes, and compilers help, but the application is still responsible for a large part of the safety story.

To make things more interesting, Quasar’s input language is Turing complete. That means the space of possible inputs is effectively infinite. You cannot “just” test all of it.

So you need structure. And you need to assume that everything around you will eventually misbehave.

Why So Much Custom Code?

Quasar has multiple layers, each with its own data formats and protocols. That includes:

Custom serialization routines for network traffic
A bespoke persistence format
Our own compression stack
High-performance binary protocols that predate many modern libraries

If you are wondering why we did not simply “use library X,” the short answer is that when some of this was built, the landscape was not what it is today. For example, BoringSSL did not exist when we started more than a decade ago.

Some of that bespoke code absolutely brings value to the user. For example:

Compression tuned for numerical telemetry
SIMD implementations of aggregation functions
Optimized layouts for high-frequency time series

Other parts, in hindsight, may have been overkill for user-visible value. For example, implementing our own binary over-the-wire protocol with custom perfect forward secrecy on top of elliptic curve cryptography might impress other engineers, but real users mostly care about reliability, speed, and cost.

The important part is this. Custom low-level code means you cannot rely solely on the safety properties of third-party libraries. You have to test aggressively against malformed and adversarial inputs from day one.

Which brings us to fuzzing.

Fuzzing: Random Data with a Purpose

Fuzzing is the practice of feeding random or semi-random data into your code and verifying that it fails gracefully.

That does not mean “never fails.” It means:

The process does not crash.
It does not leak memory or secrets.
It returns an understandable error or at least preserves integrity.

We fuzz Quasar at several levels.

1. Component-level fuzzing

We inject random data into:

Decompression routines
Serialization and deserialization functions
Dispatchers that route messages and queries

The goal is to validate how these building blocks behave when the input is nonsense.

2. Over-the-wire fuzzing

We have tools that generate random or perturbed payloads and send them to the server over the network using the actual protocol. That exercises:

Network stack
Parsing logic
Authentication and session handling
Error paths that are rarely hit in normal operation

3. On-disk fuzzing

We also randomize data on disk and then ask Quasar to load it.

This helps us understand:

How robust the persistence layer is
Whether metadata corruption can crash the server
How the system behaves when internal structures are inconsistent

Fuzzing works. It catches real bugs. But it is not a magic wand.

You cannot find all bugs this way.
You may need to run fuzzers for a very long time to see rare issues.
You often need to constrain or partially structure the random input so that you explore meaningful paths.

So we treat fuzzing as one more layer in a wider test strategy, not an excuse to relax on unit tests, integration tests, or performance suites.

Storage Hardware: How Much Can You Trust It?

Many people implicitly assume that storage is reliable.

Cloud object storage is “eleven nines,” so surely if you wrote a file and got a success status, the data is there forever, right?

Reality is less comforting.

For high-usage clusters, we observe roughly this pattern:

About once per year, an upload will be reported as successful by a major cloud object store, but later it turns out the data is not there.
This is not covered by typical durability guarantees, because those guarantees apply once the data is durably stored. A silent failure before that is a different category.

This is not a dunk on cloud providers. Services like S3 offer incredible scale at very attractive price points, and Quasar integrates tightly with them while still delivering strong performance. The point is that even “industrial-grade” storage lies occasionally.

On-premises hardware is not magically better. Expensive storage arrays and hot-plug racks can also misbehave, especially under heavy load or when components fail in subtle ways.

You should not assume that data read from disk is always correct.

Here are a few failure modes we design for, along with rough frequencies we have observed in large deployments.

Data is corrupted

The persistence layer returns different bits from what we wrote.

We defend against this with:

Checksums
Validation during decompression
Tight deserialization logic

The odds that random bit flips pass checksums, decompression, and structural validation without being caught are extremely low. That means Quasar can detect corrupted data and ask you to repair.

Approximate frequency: once in a lifetime per large cluster.

A file is missing

Quasar uses an LSM (Log-Structured Merge) tree organized in multiple files. If one file disappears, the LSM structure becomes inconsistent during compaction.

Effects:

Performance degrades gradually.
Data in the missing segment is lost.

Mitigations:

“Paranoid mode” that re-verifies uploads at the cost of some performance.
Monitoring patterns that detect early signs of missing segments.
Low-level repair tools to rebuild affected parts.

Approximate frequency on a busy 1 PB cloud-based cluster, paranoid mode disabled: about once per year.
On-premises, with well-managed storage, considerably less frequent.

Internal data is inconsistent

For example:

A column descriptor is missing.
Time values are out of expected order.
Metadata does not match the physical layout.

This can be triggered by:

Bugs
The previous types of corruption
Partial writes

Frequency is similar to the previous category, but these issues can remain invisible until a specific query accesses the broken path.

For all of these cases, the behavior we aim for is:

Do not crash.
Do not introduce further corruption while handling the error.
Provide as much diagnostic information as possible.
Fail gracefully when needed. If a query cannot be processed safely, return an error instead of pretending.

The problem is clear. How do you test behaviors that are supposed to happen “once per year” or “never”?

Intentional Data Corruption

The first time we saw a corruption-related bug in the wild, it took us a long time to even suspect the storage layer. Everyone’s default assumption was “this must be our bug.”

Once we finally identified the issue, debugging was still difficult:

You cannot always extract data from a production system.
Even if you can, reproducing the exact hardware or timing conditions is often impossible.
You cannot attach a debugger to a client’s system in the middle of operations, especially when it is isolated or in a sensitive environment.

The answer was to build a corruption generator.

We created a tool that intentionally corrupts data in as many ways as we could think of:

Remove columns
Rename or mangle aliases
Corrupt data blocks
Randomize time indexes
Inject inconsistent metadata
Partially truncate files

Then we observe what Quasar does:

Does the server survive?
Do queries fail cleanly?
Is the error message actionable?
Does repair tooling do the right thing?

Every time we encounter a new failure mode in the wild, we add it to the corruption scenario library and run it regularly. That way, we make sure:

The system no longer crashes in that scenario.
The error path still works after future changes.
The repair tools behave correctly.

This is also the only realistic way to test REPAIR functionality. You need broken data to know whether repair works.

One interesting engineering challenge was to bypass integrity checks during write. Normally, when we write data to disk, we perform multiple verifications to avoid writing malformed data in the first place.

For the corruption tool, we had to be able to:

Turn off or bypass some of those guards selectively.
Do this without duplicating large swaths of code.
Keep the corruption logic well isolated from production paths.

The end result is that we can corrupt any part of a table, on any instance, in a controlled way. That gives us a very powerful lens into how Quasar behaves when the world goes wrong.

Assertions in the Wild

We like assertions. A lot.

Assertions encode assumptions directly into the code:

“This pointer must not be null.”
“This value must be positive.”
“This time column must be sorted.”
“The flux capacitor must receive 1.21 gigawatts.”

When an assertion fails, you immediately learn that your mental model is wrong.

One example for Quasar. For a table that represents a time series, we assert that the time column is chronologically ordered. If this invariant breaks, you know there is either:

A bug in the ingestion path, or
A hardware or storage issue that reordered or corrupted data

We compile and run tests in both debug and release modes, so we catch invariant violations during automated testing.

But what about production, where you cannot simply crash on an assertion?

We maintain a special “instrumented” build that replaces assertions with log messages instead of aborting. This build:

Runs slower, because it performs more checks
Emits detailed logs when assumptions are violated
Is used in collaboration with specific customers to diagnose elusive issues

This has been invaluable for tracking down “heisenbugs” that we could never reproduce in our own lab, even with heavy stress testing and fuzzing.

From Chaos to Confidence

None of the techniques above are magic.

Fuzzing does not find every bug.
Corruption generators do not cover every possible failure.
Assertions cannot express every assumption.
Hardware will still surprise you when it fails in new ways.

The point is layering.

We stack multiple strategies:

Compile-time constraints and static assertions
A large suite of unit and integration tests
Performance and regression benchmarks
Fuzzers at different layers of the system
Tools that intentionally corrupt data and metadata
Assertion-heavy builds that can be deployed in the field

Each layer catches a different class of issue. Over time, this compound effect builds confidence.

You do not get reliability from any single best practice, library, or slogan. You get it from deliberately assuming that things will go wrong, then designing processes and tools that let you observe, resist, and repair those failures.

If you are building or operating complex systems, breaking your own software on purpose is something you should probably add to your toolbox.

In the next part of this series, we will go deeper into how we use monitoring, logging, and long-running chaos setups to keep Quasar honest. Or we will share a cake recipe. The world is unpredictable.