Scaling with C++ 11
My “Scaling with C++ 11” talk is online.
It’s really great to see yourself in video because everything bad about your presentation becomes obvious. I already have a lot of ideas to improve this one and I hope I can deliver something much better at Meeting C++. I hope to see you there!
While I’m at it, there’s a couple of things I’d like to add.
I say that you have a hidden shared write with smart pointers, and I realize that this requires a more detailed explanation.
The hidden shared write is not because you have many threads accessing the same object, although this can be a problem, this is also an issue with raw pointers.
The performance penalty induced by shared pointers is because of the barriers added by the compiler when manipulating the reference counter. In itself, manipulating the reference counter has got a negligible cost, it’s because this counter manipulation needs to be atomic and because multiple threads are writing to the same value that you have a potential performance degradation in a multithreaded context.
If you look at Boost’s sharedptr implementation, and especially the reference counter one (in boost/smartptr/detail/shared_count.hpp) you will notice that the reference counter is heap allocated and shared by all instances.
This is why using makesharedptr is more efficient as it reduces the allocations count to one while increasing locality.
I have an old blog entry about this (However it seems that I don’t write too much about the shared write problem).
As you can see I obviously struggle with the read-copy update. A combination of over inflated ego and naivety made me believe I could explain this technique in one slide. Why don’t you read an in-depth explanation?
Oh you want a TL;DR, right?
Read-copy update creates new version at each modification, permitting an excellent scalability as concurrent reads scale perfectly and there are no writes to the same memory location. As I said in the talk, the problem is reclaiming old entries. To do that, access to RCU structures are done within read-side critical sections. You might be tempted to say, “so you’re implementing a lock-free structure, with a lock?”.
I will confuse you even more by saying that the kernel will not only be able to wait for all readers to finish, but it will be able to do so in a safe way, without tracking every reader and at the same speed whether you have 1 reader or 100,000.
To do that, you have a constraint, within a read-side critical section you are not allowed to sleep or block (that is, you cannot call any kind of kernel synchronization primitive such as a spin lock or relinquish your quantum). This means that if your CPU has switched context, it has finished working on the read-side critical section (if you set aside for a while the problem of interrupts and preemption).
The process is the following:
- Update the value
- New readers will access the new value and are therefore not an issue
- Force/wait for all CPU to switch context
- No CPU is accessing the old value, it is safe to delete it.
A way to understand the benefits of RCU is to see it as a way to transform a synchronization problem depending on the number of readers into a synchronization problem depending on the number of CPU. Not only you win on the number, but since it’s the kernel’s job to know what each CPU is doing, you have a lot of optimization opportunities.
I was asked why using a concurrent structure can induce problems you may not have with non-concurrent structures. I realize my explanation isn’t very satisfactory, to say the least.
Basically, the idea is that when you switch to a concurrent structure, you are most likely trying to remove a mutex from your code. This mutex might be protecting more than the access to the structure and its removal may induce a race condition.
Additionally, concurrent containers generally support concurrent addition and traversal, but not concurrent removal. Or they might support concurrent addition and removal, but not concurrent traversal. This is generally the moment when you stop reading this blog post and go back to your code, just after the sudden realization your code is packed with race conditions hits you.
Concurrent structures are very powerful, but not magical and generally cannot replace a regular structure without serious code auditing. I really encourage you to use them, but be careful.
I don’t know for you, but when I buy a Blu-Ray, I expect to get bonuses, and the least we can say is that this C++ Now talk doesn’t have a lot of bonuses. Fortunately for you, Santa is early this year.
The Star Trek episode I’m talking about is Déjà Q and I managed to find the exact passage:
Make our motto yours: nothing is impossible.