Cache replacement policy (caching) is one of the hardest problems in computer science.

Given a set of elements n and a storage capacity m, usually much smaller than n, your job is to keep in storage the elements most likely to be used in the future. In other words, make the best decision possible about which entry to remove when the cache is full.

The staple of eviction algorithms is Least Recently Used (LRU), which removes the least-recently used item from the cache when the cache is full. It’s a simple and efficient algorithm that works well for most use cases.

However, LRU isn’t perfect.

LRU lacks a broader view of the workload. It only tracks the most recently accessed items, so it can confuse accidental recency with real importance. For example, if your browser cache holds pages you use daily and you click through 200 search results once, a plain LRU cache may evict the pages you rely on and keep the random results simply because they were accessed last. That data was cold, but LRU cannot tell; it only sees that it is recent.

Still, LRU is very frequently the best choice. Why is that?

Why is everyone using LRU, when we know there are better algorithms?

The short answer is that LRU is simple, and simple matters a lot here.

LRU is simple, and simple matters a lot here.

A typical LRU implementation is a hash table plus a list. The hash table gives you lookup. The list gives you the recency order. On access, you move the item to the front. On eviction, you remove from the back. Why does it matter? It’s O(1): constant complexity.

Complexity matters because eviction is often on the wrong side of a performance-sensitive path. If the cache is full and a write needs memory, the system may need to evict something before the write can continue. Depending on the architecture, that decision can block many other writes.

In other words, the cost of deciding what to evict is not a detail. It becomes part of the write path.

This is where discussions of cache policy can become misleading. It is easy to compare algorithms by hit ratio, but we often forget the cost of maintaining the metadata needed to get that hit ratio. A better eviction decision is only better if the cost of making it is low enough.

This is the main reason LRU keeps showing up. It is not because people do not know about better algorithms. It is because LRU gives a reasonable answer quickly, with little state, little complexity, and failure modes that are easy to understand.

This is why Quasar chose LRU as the core of its caching strategy: fast, predictable, and easy to maintain. Nobody wants to work until 2 am to fix a complex cache eviction bug.

So why did we take the risk to change the central piece of code?

The problem was scans.

With plain LRU, a large query that accesses a lot of data in a single pass can pollute the cache. Every item touched by the query becomes recent. The cache cannot tell whether the item is relevant or whether it was accessed once because it happened to be part of a scan.

The result is that a one-time query can evict data that was genuinely hot and replace it with data that may never be touched again.

This is not a corner case in a database. Users run large queries. Background jobs touch large ranges. Dashboards sometimes ask for more than they need. The cache should not become worse every time someone does something reasonable.

How we evaluated alternatives

Once you decide LRU is not enough, you’re faced with a new problem: the infinite amount of literature on caching strategies. By the time you’re done reading what has been written, there’s more stuff to read. The number one job was to conduct intensive pruning of the literature.

Here is what we needed from the next algorithm:

Scan-resistance
Very fast decision making, can be made (almost) lock-free, or at least very small contention
Reasonable complexity

I’m going to save you the time and discuss the most serious contenders we’ve looked at.

CLOCK is the first natural candidate after LRU. It is a smarter implementation design: instead of constantly maintaining a perfectly ordered list, it uses a reference bit and a circular sweep. That can make updates cheaper and may scale better in a multithreaded system. But CLOCK is not scan-resistant. It improves LRU’s mechanics more than its eviction decision, so a large one-time scan can still push cold data into the cache and disturb useful entries.

2Q is one of the classic scan-resistant policies. The idea is to avoid promoting data too aggressively after a single access. Data that has only been seen once is treated differently from data that has demonstrated reuse.

LRU-K takes a related approach. Instead of remembering only the last access, it remembers the last K accesses. This gives the cache more information about whether an item is genuinely being reused.

ARC and CAR go further. They try to balance recency and frequency, which is attractive because different workloads require different behavior. CAR in particular was interesting to us because it has good properties on paper and avoids some of the implementation costs of a fully list-based adaptive policy.

But this is where the paper version of the problem and the database version diverge.

We were not trying to implement the most impressive eviction policy. We were trying to improve Quasar’s cache behavior without slowing down the hot path or making it harder to reason about.

CAR, or something equivalent, looked like more complexity than we wanted to take on. The issue was not only the implementation time. It was also the number of moving parts, the extra metadata, and the synchronization questions in a multithreaded system.

Quasar is very much multithread land. An algorithm that looks clean in a single-threaded description can become much less attractive once you ask how often threads need to coordinate, how metadata is updated, and whether the policy introduces a new contention point.

We also considered random eviction, mostly because it is almost unbeatable on update cost. That sounds crude, but it is not a stupid baseline. If the cache policy becomes expensive enough, a “worse” policy that makes decisions quickly can win in practice.

Random eviction was not right for us because it ignores too much useful information. Recency and repeated access do tell you something. We did not want to throw that away. We wanted to keep most of LRU’s simplicity while addressing the specific weakness that hurt us.

That pushed us toward LRU-2.

Picking LRU-2, a variant of LRU-K

We picked LRU-2 because it is scan-resistant and still maintains a reasonable level of complexity.

LRU-2 is the K = 2 version of LRU-K. Plain LRU looks at the only timestamp it has. LRU-2 makes the decision based on the second timestamp.

What happens when the entry has been seen only once? This is where LRU-2 leaves a few practical choices open. You can keep those entries in a separate probationary list, prioritize them, or down-prioritize them. For scan resistance, down-prioritizing them is usually the point. A single access should not be enough to compete with data that has already shown reuse.

We picked LRU-2 because it is scan-resistant and still maintains a reasonable level of complexity.

However, LRU-2 is not just LRU with one extra timestamp. With plain LRU, the eviction order is simple: update a recency list on access and evict from the tail. With LRU-2, entries are ordered by their second-most-recent access, and some entries do not have one yet. Those one-time entries need a policy of their own: keep them separate, prioritize them, or make them easier to evict.

That makes the implementation materially more complex. Lookup is still straightforward with a hash table, but eviction requires maintaining sufficient ordering to quickly find a candidate. Otherwise, the cache has to scan entries at eviction time. Access updates also have more cases: first access, second access, and later accesses all update the metadata differently.

Quasar Implementation specifics

In this section, we’ll go over some optimizations specific to Quasar and explain the rationale behind them.

Single timestamp use case

Since our priority was to be scan-resistant and simple, we treat the single-timestamp use case as a higher-priority eviction, meaning an entry with a single timestamp (seen once) will be evicted before any entry with two timestamps (seen at least twice).

Ghost list

The above approach creates the problem that new entries may never be promoted to the cache, which is why every entry evicted is added to a ghost LRU. The ghost list preserves the timestamp at which the entry was seen. When a single timestamp entry is evicted, it’s moved to the ghost list. If that entry is seen again, it’s stored with two timestamps: the current timestamp and the one in the ghost list.

Data structure

We maintain a hash table for immediate lookup of an entry by name (updates and insertions) and a B-Tree to keep entries ordered by timestamp. A node on a large cluster (a multi-petabyte cluster) rarely exceeds 10M entries: the performance of a B-Tree at that cardinality is excellent.

To further improve performance, we encoded the two access times into a single 64-bit value. This gives us a compact representation of the LRU-2 state and lets us update and compare it efficiently and also improves the performance of the B-Tree, which is very “size of the entry” sensitive.

Unfortunately, that means we need to hold a lock for most update operations, which can create contention. To work around this, we shard the cache by entry name, greatly reducing the contention.

Aftermath

The new cache strategy has been made the default since our 3.14.2 release. All our significant deployments are running on that version, and here are the takeaways we’ve seen in production (think hundreds of billions of rows per day, petabytes of data, intense heterogeneous querying):

Eviction performance is identical
No change in insertion performance
Scan resistance, which resulted in lower eviction activity and overall improved performance

Factually, LRU-2 does “more work” at insertion, update, and eviction, but since you make better cache decisions, you evict less, and I/O operations are several orders of magnitude slower than memory operations.

The other part is that while LRU-2 is more complex, we’re still at a level of complexity that’s negligible compared to all the operations happening every time you run queries.