Memory allocation pitfalls on multi-core CPUs

Instapaper Text

Memory allocation pitfalls on multi-core CPUs

Although it is less and less common nowadays, there are still “Thread-Safe Memory Allocators” in use. What do I mean with this? A standard, single-core based allocator that uses a simple locking mechanism on top to avoid race-conditions.
I am usually a big fan of “The simplest solution”(tm), but this one unfortunately leads to two big problems on multi-core architectures and therefore doesn’t really qualify as a ‘solution’ at all.

Thread contention

I think it is pretty obvious that thread contention is bound to happen. When one thread is accessing the allocator ( allocating or releasing memory ) all other threads that are trying to do the same are blocked. It does not matter how fast the allocator is, as it will never be fast enough to not introduce contention and block other threads. This issue has an impact on performance especially in standard high-level gameplay code. As high-level gameplay code tend to use the allocator a lot ( creating/destroying objects, growing/shrinking dynamic arrays, etc. ) this is a recipe for just throwing away clock-cycles. For no gain at all. I am not talking about a few nano-seconds here as depending on the amount of runtime allocations, this can sum up faster than one might expect.

False Cache-Sharing

That is the more serious issue, and not that obvious to see. Two threads are working on data in a memory-area that is mapped to the same cache-line. This is not a theoretical problem, but a situation that is not that unlikely to happen. The probability of running into that increases with the amount of allocator contention. There is a good chance that a non-thread-aware allocator returns consecutive memory areas for consecutive allocations. If these allocation requests are coming from different threads, false cache-sharing is waiting to happen.

Example

Thread_A resides on CPU0
Thread_B on CPU1.

Both threads are doing totally unrelated calculations and both of them are allocating some memory.
Let’s assume both get a chunk of memory from the same cache-line.

This situation is called ‘false sharing’ or – what is even more fitting – ‘cache line ping-pong’. We have now created the biggest nightmare ( at least performance-wise ) for the cache-coherency protocol.

- Thread_A writes to his memory.
- This invalidates Thread_B‘s cache-line.
- The cache of Thread_A must be written back to memory …
- … and read back again to the cache of Thread_B.

The same applies if Thread_B is modifying its memory area.

If you are interested in more details and also some performance impact measurements, check out ‘Analysis of False Cache Line Sharing Effects on Multicore CPUs’.

#AltDevBlogADay

Martin Zielinski
Follow @DarthCoder