Engineering Reliability at Scale, Part 3 - Taking a Step Back

Quasar is a distributed, real-time data warehousing management system optimized for numerical data. Our users expect it to stay up, stay fast, and keep their data safe while they make decisions that actually matter.

In the previous articles, we explored testing philosophy and why intentionally breaking systems is essential for building confidence. In this instalment, we zoom out. Reliability does not come from a single component behaving perfectly. It comes from the entire ecosystem behaving predictably when things fail.

Key Takeaways

Perfect software is not enough. Even if Quasar had zero bugs, the systems around it introduce failure modes you must plan for.
Reliability is about economics. The right level depends on the cost of failure, not abstract ideals.
RPO and RTO are tradeoffs. Lowering them increases cost and complexity exponentially.
Performance and reliability interact. Faster is not always safer. Safer is not always faster.
Reliability is systemic. You must evaluate stream engines, storage layers, backup strategies, hardware, and cloud infrastructure.
Failures will happen. Survival depends on preparation, not optimism.

⏱️ 10 min read. Worth it if you care about making your worst day tolerable rather than catastrophic.

A bold hypothesis

No software is bug-free. Not PostgreSQL, not Linux, not anything. Scan the latest patch notes of any large system and you will find bugs that look unsettling. This is normal. Complexity guarantees imperfections.

We have written at length about minimizing bugs inside Quasar. This article focuses on something else: everything around Quasar that can break, misbehave, or surprise you. Because even in a world where Quasar were flawless, you could still run into serious problems:

• Misunderstanding an API contract
• Exceeding server capacity
• An operating system or hardware fault

If your mental model of reliability assumes a magic box that absorbs problems for you, reality will eventually correct that.

What do you really need?

When people talk about reliability, the instinctive answer is always “as much as possible.” That answer is useless until you define the cost of failure.

The right place to start is simply: What does a failure actually cost you?

Concrete framing helps keep the discussion grounded:

• In pulp and paper, damaging a Yankee roll because vibration alerts failed can cost $2 million and heavy downtime.
• In finance, missing a daily Value-at-Risk report triggers regulator escalation and can halt trading activity.
• In research, losing experimental data means rerunning the experiment and losing time you cannot buy back.

When reliability is anchored to objective cost, people stop asking for “infinite nines” and start asking for what they actually need.

Understanding reliability

Reliability is a balance between cost, performance, and recoverability. The goal is not “never fail.” The goal is fail in a controlled, diagnosable, recoverable way.

Two numbers dominate the conversation:

• RPO (Recovery Point Objective): how much data loss you tolerate
• RTO (Recovery Time Objective): how long it takes to restore service

Driving either number downward increases cost non-linearly. More replicas. More orchestration. More hardware. More operator risk.

Performance interacts with these choices. Pushing for maximum throughput often removes the buffers that make systems resilient. Adding defensive checks improves safety but slows things down. The craft lies in finding the intersection that serves your workload rather than punishes it.

A simple example in Quasar: when writing to object storage, you choose between no verification, hash verification, or a full paranoid cycle where you write, checksum, download, and verify again. Some workloads genuinely require the paranoid option.

Reliability is emergent. It is the product of choices across the entire stack: hardware, replication topology, consistency models, deployment discipline, and organizational habits.

Looking at everything

This is why zooming out matters. Assume Quasar behaves perfectly. Then ask:

• Does your stream engine have meaningful guarantees?
• How are your backups validated?
• Do you have redundancy on the storage layer?
• How quickly can you replace a failed server?
• What happens if your cloud provider loses an entire region?

Reliability planning is simple in principle: estimate the likelihood of each failure, multiply by the cost, and compare that with what you spend to mitigate it.

Most workloads never need multi-region or multi-cloud strategies. For others, it is the bare minimum.

Acceptance

The final point is never pleasant.

Something in your environment will break, and you will have a bad day. You will discover a missing piece in your risk analysis. You will underestimate recovery time. Diagnosis will take longer than expected.

That day is survivable if you planned for degraded operation rather than perfect operation. Reliability is not the absence of failure. It is the confidence that when failure happens, you still have a path forward.

In the next part of this series, we will go deeper into how we use monitoring, diagnostics, and long-running stress setups to keep ourselves honest. Or we will share a cake recipe. Reality will decide.

January 16, 2026

Engineering Reliability at Scale, Part 3 - Taking a Step Back

Key Takeaways

A bold hypothesis

What do you really need?

Understanding reliability

Looking at everything

Acceptance

Related Posts

QuasarDB “Seneca” 3.14.2 Released

Engineering Reliability at Scale, Part 2 Break It to Make It Stronger

Beyond Unit Tests: Building Confidence in Complex Systems

Latest Demos

Top Features

Help Center

Tutorials

Chat Now

Community

Wall of Fame

Follow