Monday July 18, 2022
Maps, dictionaries, associative containers… However you name it, maps are omnipresent, but do you know how much the implementation matters?
This article will go over what we, at Quasar, learned about maps and hash tables after years of squeezing every little bit of power out of our CPUs.
Before we dive in, I just want to mention that we have an unlimited number of positions open for C++ software engineers. If you love C++ and working on big problems, what about you come working for us?
At Quasar, we work on large-scale problems where performance matters. Small changes in the code can result in massive speed gains because when you work at the petascale, every little detail matters.
The Quasar database – a central piece of our offering – is a distributed, column-oriented database optimized for timeseries with a SQL-like interface.
Maps are used all over the place. From sorting input before writing, to building aggregations, maps significantly affect performance.
Over the years, the meticulous analysis of map structures and implementations paid huge dividends.
However, the context of Quasar may not be your context.
For example, hash maps with hundreds of keys are a sweet spot when aggregating data. Additionally, we can often accurately predict how large a map will be. Thus, we can optimize memory allocations. Another example is there are many instances where we load already sorted data into a map.
Now that it’s out of the way let’s dig in!
You could split maps into two families, which C++ developers will be familiar with: ordered and unordered. An ordered map guarantees that the entries will be sorted by key; the unordered map makes no such guarantee.
As you can imagine, the sorted guarantee comes with a price tag in terms of performance, but there’s more to it.
In C++, the sorted map (std::map) is usually implemented as a binary tree, and the unsorted map (std::unordered_map) is a hash table with closed addressing.
A hash table can deliver O(1) lookup time, whereas a binary tree has O(log n) lookup. This means the number of elements in a hash table doesn’t influence the lookup speed.
Hash tables are fantastic, but they have many caveats and are thus more challenging to use well. In particular, they require an excellent hashing function (for integers and strings, the standard ones are okay and will avoid degenerate cases).
Hash tables also come into two families, open and closed addressing. Open addressing offers higher performance (unless there is a high churn). Still, it doesn’t offer any guarantee in terms of reference validity which is why STL implementations use closed addressing (see §23.2.5/13 and 14 of the C++ 11 standard).
Over the years, we haven’t limited ourselves to what the standard libraries offer us. Not only did we look at different hash table implementations, but we also worked with nonstandard containers, in particular sorted vectors.
Sorted vectors are much faster than binary trees for lookups with potentially much more expensive insertion times. Well used, however, they deliver outstanding performance. For example, inserting sorted data is of O(1) complexity.
Because in C++ there are hundreds of libraries the offer data structure, we are not even remotely trying to be exhaustive.
We will discuss the following structures:
Before I share with you the benchmark result, let’s get things straight.
Benchmarks exist to give you a ballpark idea of how something behaves. They don’t represent real life, and you may get a very different result in your specific case.
Let me give you an example. Imagine that I give you a – correct – benchmark where I show you a hash table that inserts twice faster as the standard implementation. You rush to use it in your code, and then one month later, QA comes back to you saying they observe substantial performance regression.
In your case, this hash table appears to be performing poorly. Maybe because it uses more RAM? Maybe because it makes the code bigger, your hot loop suddenly no longer fits in the instruction cache? Maybe the hash function of the standard library was behaving better than the one of the new hash map?
There are so many variables that your guess will be wrong all of the time. A benchmark is an experiment, whose results may or may not apply to your case. If there’s anything to take home from this article, it’s that.
For the PRNG, we used a custom implementation of xorshift. It does not need to be cryptographically strong but good enough for dispersion and, most importantly, faster than the map functions we bench. The seed is hardcoded so every map run is identical.
Ints are the raw output from the generator, while strings are a stringification of the integer using libfmt.
While data generation is probably too slow compared to the fastest map implementation regarding insertion benchmarks, all implementations will suffer the same penalty, and thus, the test remains fair.
We were interested in the following mappings:
We chose those mappings because they are the most frequent mappings we encounter in Quasar.
We benchmark for sizes between 8 and 4,194,304 as the overwhelming majority of our uses cases fit in this range (larger maps in Quasar are concurrent).
We have two insertion tests.
One “full entropy”, one “low entropy”.
The full entropy will use the whole 64-bit range of data generation, whereas the low entropy reduces it to 10,000 possibilities. The low entropy has more collisions as the map grows, which is a more realistic representation of real workloads.
We insert the data using “try_emplace” to give as many optimization opportunities to the implementation as possible.
The lookup inserts random data into the map and lookups at eight (8) random keys. The keys are chosen before the benchmark is run; thus, the PRNG does not influence the speed.
For the hashing functions, we use for integers a custom implementation of Murmur3. For strings, we use XXHash3.
For the allocator, every map used the TBB allocator unless it was impossible to change the allocator.
Setting the hash function and allocator for every container leveled the field, and we highly recommend you to do it when benchmarking containers.
Google Benchmark was used to write the tests, which has the advantage of running the test multiple times for accurate speed measurements.
The benchmarks were run on Windows 11. The CPU was an Intel Core i9-10900KF.
The benchmark was compiled using Visual Studio 2022 17.2.3 using the x64 target platform with speed optimizations and AVX2 instructions enabled. The STL implementation is Dinkunware’s, included with MSVC.
For the full entropy, flat_map performs so poorly we had to remove it from the results. This is the tradeoff of sorted vectors.
Robin_hood is the best performer when the pair size is small enough to fit in the table, and the entropy is high. As soon as the pair is bigger (string to string), the STL implementation takes the lead by a small margin.
For low entropy, the difference between the STL implementation and robin_hood is smaller, which hints that the lookup performance difference between Robin Hood and the STL isn’t significant.
You can notice that for low entropy, the flat map outperforms the map quickly, except for larger objects. That’s because lookup is faster on a flat map, but when the cost of moving objects increases, the lookup benefit is eaten up.
std::unordered_map is bettering robin_hood in several benchmarks because Robin Hood doesn’t support allocator overloading, and whatever benefits open addressing offers are compensated by the superior allocation strategy of TBB.
Boost’s implementation does better than the map but is the slowest hash table implementation. This problem will be resolved in Boost 1.80.
In conclusion, if you’re only considering speed, for small pairs, Robin Hood is the fastest, otherwise, std::unordered_map wins.
The difference between robin_hood and Dinkumware’s implementation is minimal for lookup performance. If we had picked the STL’s default hashing function, robin_hood would have crushed the STL implementation, highlighting the importance of the hashing function.
When doing string-to-string mapping, you can see the difference between all hash table implementations is almost invisible because the cost of hashing the key is higher than the actual lookup in the table. You can also see that the flat map implementation dominates the map, except for the string. Again, comparing keys becomes non-negligible (string comparison) compared to jumping to the next element
We use hash tables for aggregations, joins, or any time we need a fast lookup. Thus, hash tables speed matters to us.
At Quasar, we used Robin Hood for a while because the high insertion performance was excellent for aggregations. However, Robin Hood has a huge implementation limitation: the allocator cannot be changed, and the allocation strategy is sub-optimal.
We noticed that it resulted in memory fragmentation or excessive memory usage for long-lived objects because the table doesn’t release any memory until it’s destroyed. Also, it calls free() and malloc() manually to reserve blocks and does non-cache-friendly alignments. If you force global allocator overload, this cooperates extremely poorly with an optimized allocator strategy.
Since 3.13.4, we reverted to the default STL hash table (on every platform), using our custom-designed hash functions and the TBB allocator. We are pleased with the current performance.
As mentioned above, we’re assessing Abseil and will re-evaluate the Boost implementation once 1.80 is available.
By looking at the benchmarks, you could conclude that flat maps aren’t great.
That’s forgetting an important point: when input data is sorted, complexity is constant: you’re just appending to the back of a vector. If you reserve the vector, and use the moving initialization of Boost’s flat map, it is the fastest map you can initialize, even faster than an unordered map.
On top of that, allocation pressure is much better. With a flat map you can reserve the structure and thus reduce the number of allocations. Long-lived programs like Quasar must minimize memory fragmentation.
At Quasar, we primarily use sorted vectors unless insertions and deletions are frequent, in which case binary trees are a much better fit.
We are currently reviewing B-Trees, which can be seen as an intermediate between binary trees and sorted vectors.
If you need an associative container and don’t know what to pick, use std::map. It has the advantage of excellent worst-case behavior.
If you are sure you don’t care about the order of entries, use std::unordered_map. Not caring about the order of the entries is not as apparent as it looks. For example, std::unordered_map gives no guarantee that two identical maps will have the entries in the same order.
If you are unhappy with the performance of std::unordered_map, try first to change the hashing function and the allocator. If that doesn’t suffice, look for a good hash table with open addressing, Abseil is, to us, a good one. If the memory management of robin_hood doesn’t bother you, it can also be a good choice, especially if the pair is small.
With our review of Abseil and the impending release of Boost 1.80, we are bound to refresh our benchmarks and probably update our internal structures after a thorough impact evaluation.
If that article got you curious about Quasar, get at our free of charge community edition!
By the way, did I mention we are hiring?