So we all know that cache misses kill performance, since reading from memory is slower than tectonic plate movement. Reasonably recently, Data Oriented Design has gained a lot of traction for improving cache friendliness; but how do caches actually work ?
This post will uncover some of their inner workings. Going to be reasonably theoretical, so other than a few small tips near the end, this is not really a “do this and your code will be faster” kind of post. More aimed at fitting another piece to the puzzle of groking the system.
First up we’ll cover some of the basics of caches, then we’ll get into how cache coherency protocols work.
Associativity
Obviously the idea behind a cache is that if a memory location is accessed once, it or nearby locations are likely to be accessed again soon after. The granularity of the blocks of memory a cache works with is the cache line size. One cache design is that when memory is accessed, the CPU rounds the address down to the first cache line boundary, and then looks throughout the entire cache to see if that line is currently there. If it is not currently in the cache, then something else is evicted to make room. This design where the entire cache is checked, is called a fully associative cache, memory may be loaded into any cache entry.
A more common design than a fully associative cache, is a set associative cache. Each memory location is associated with a set of cache entries, or “ways”. When a memory address is checked to see if it is in the cache, the memory address is used to form an index to the correct set, then only each of the ways in the set needs to be checked. The set index of the memory address is formed by masking with the size of a way, then dividing by the size of a line.
For example, lets say we have a 32KB, 4-way set associative cache, with a 128-byte cache line size. The total cache size is 32KB, so each of the 4 ways must be 8KB (0x2000). So if we accessed location 0xdeadbeef, we’d use the set index ( 0xdeadbeef & 0x1fff ) >> 7 = 0x3d.
A fully associative cache can be seen as a special case of a set associative cache, where the number of ways is the total number of cache entries. The other extreme, a 1-way set associative cache, is known as a direct mapped cache.
Generally you don’t need to worry too much about how many ways there are in a cache, but it can be important to keep in the back of your mind when working with multiple data streams at the same time. The following pathological case shows how bad it can get when things go wrong,
float sum_floats( const float* __restrict__ stream0, const float* __restrict__ stream1, const float* __restrict__ stream2, unsigned count ){ float total=0.f; for( unsigned i=0; i<count; ++i ){ total+=stream0[i]; total+=stream1[i]; total+=stream2[i]; } return total; }
Say this code was run on a CPU with a 64KB, 2-way set associative cache, and the input pointers were stream0=0x00008000, stream1=0x00010000, stream2=0x00018000. Here stream0[i], stream1[i] and stream2[i] all map to the same set. If the cache line size was 128-bytes, then each miss would fetch 32 floats from memory. But the cache line will be evicted before we get to read the next float from the same stream, we will get a cache miss on every access!
Of course, even once the cache aliasing problem is fixed, that code example is still nowhere near optimal : ) And if you want accuracy, then a Kahan Summation is much better (
The L2 is inclusive of the L1 dcache, but not the L1 icache. Also, as well as the 8-way set associative mode, the L2 can be configured to directed mapped (L2_ModeSetup1.RMT_Mode=11b). On x86 processors, the CPUID instruction can be used to determine cache configuration. This is a fairly complex instruction, and there are actually several ways to call it and get cache information. The older way of EAX=2, is quite painful, as it requires software to keep a very large lookup table of values to decode the output. Newer CPUs support EAX=4 which gives much easier to parse output. AMD and some other non-Intel CPUs also return cache info for EAX=0x80000005 and EAX=0x80000006, though even Intel CPUs that support larger EAX=0x800000xx values do not support returning cache information this way (at least for my Nehalem Core i7). When there are multiple processors in a system, then their caches need to communicate with each other in order to keep the whole system consistent. Two different caches should never disagree on the contents of a line. The mechanism by which the caches talk to each other is known as a cache coherency protocol. There are two main categories of coherency protocols, snoopy (*) and directory based. Snoopy protocols work by every cache being able to see every memory transaction on a bus. Caches watch/snoop other bus transactions to see if any of them affect any data it currently has cached. Snoopy protocols are fast for small numbers of bus agents, but they do not scale so well. Directory based protocols handle large scale systems better, but tend to be slower for small systems. Snoopy protocols are what we’ll cover here. There are actually systems where no automatic consistency is guaranteed, but we’ll ignore them here. (*) really wanted to insert a image here of Snoopy the cartoon dog, the detective ones with him and a magnifying glass would have been perfect, but it’s very unclear on the copyright status of all the images floating around the net : ( A common snoopy protocol is MESI. MESI has been used by Intel in its x86 processors since the original Pentium. It’s an acronym for the four states a cache line may be in in this protocol, Modified, Exclusive, Shared or Invalid. The following table, shamelessly stolen from “Intel® 64 and IA-32 Architecture Software Developer’s Manual, Volume 3A” (
There are two basic ways a cache line can change state; an action by the processor who owns the cache, or an external bus event. The following diagram shows the basic CPU initiated ways that a cache line can change state. When the CPU performs a load and the cache line was currently invalid, then if no other cache currently held a copy, it becomes exclusive, else shared. When the CPU performs a store, the cache line will become modified. If the line was previously exclusive, then the transition to modified can be done without sending any commands to the bus. If the line was previously shared, then a “data claim” (DCLAIM) command needs to be sent to gain ownership. And if the line was previously invalid, then a “read with intent to modify” (RWITM) command both reads the data and gains ownership. Now the previous diagram was just to show the basic ideas, it is not comprehensive. Some CPUs have additional cache control instructions that can cause other state transitions, for example an instruction to flush the cache could change a line from modified to invalid. The next diagram shows the state transitions triggered by snooped bus commands. When a cache has a line in either modified or exclusive state, and it snoops a read of this line, it intervenes on the bus, and gives a copy of the data to the reader. If the line was in a modified state, it is also then written out to memory. The line then transitions to the shared state. When a cache has a line in any of the valid states, modified, exclusive or shared, and it snoops a command to invalidate the line, the state changes to invalid. If the line was modified, then the contents are written out to memory. Here “invalidate” isn’t a single bus command, rather it is one of the effects of several different commands. The RWITM and DCLAIM commands mentioned earlier, are two examples that fall under the “invalidate” banner in the previous diagram. There are many similar cache coherency protocols like MSI, MOSI, MOESI, MERSI, etc. Again these are acronyms for the possible cache line states. To confuse matters, there are actually at least two different MERSI protocols. The Cell extends (the “recent” version of) MERSI further still, using MERSI plus unsolicited modified (Mu) and tagged (T) states. IBM didn’t seem to want to follow the previous acronym trend, probably because there are so many these days, and as we have seen, they are starting to clash. Though i do like the sound of MERSIMuT : ) If at this point your thinking “but in the Cell, only the PPE has a cache, not the SPEs, so why all this ?”, the answer is that the SPEs do actually each have a cache. The atomic unit in each SPE is a in fact a small cache. And this cache coherency protocol allows for very fast access to shared data on the Element Interconnect Bus (EIB). Again we need to do some investigative work to find out what these unsolicited modified and tagged states are. The CellBE Handbook only gives their names. The “IBM (e)server POWER4 System Microarchitecture” (
BTW, if you’ve ever looked at some of the Cell performance counters and thought “WTF do these mean?”, then understanding the cache states will have cleared some of them up. If we now look back at the PPE cache heirachy, we can see the reason behind some of the design choices. Here’s the table from earlier,
L1 icache
L1 dcache
L2
Size
32KB
32KB
512KB
Associativity Ways
2
4
8
Write Miss Allocate
N/A (read only)
No
Yes
Write Through/Back
N/A (read only)
Through
Back
CPUID
Coherency
MESI
Cache Line State
M (Modified)
E (Exclusive)
S (Shared)
I (Invalid)
This cache line is valid ?
Yes
Yes
Yes
No
The memory copy is …
Out of date
Valid
Valid
-
Copies exist in caches of other processors ?
No
No
Maybe
Maybe
A write to this line …
Does not got to the system bus.
Does not go to the system bus.
Causes the processor to gain exclusive ownership of the line.
Goes directly to the system bus.
MERSI
Cache Coherency In The Cell
Cache Line State
M (Modified)
E (Exclusive)
R (Recent)
S (Shared)
I (Invalid)
Mu (Unsolicited Modified)
T (Tagged)
This cache line is valid ?
Yes
Yes
Yes
Yes
No
Yes
Yes
The memory copy is …
Out of date
Valid
-
-
-
Out of date
Out of date
Copies exist in caches of other processors ?
No
No
Maybe
Maybe
Maybe
No
Maybe
A write to this line …
Does not got to the system bus.
Does not go to the system bus.
Causes the processor to gain exclusive ownership of the line.
Causes the processor to gain exclusive ownership of the line.
Causes the processor to gain exclusive ownership of the line.
Does not got to the system bus.
Does not go to the system bus.
The PPE Cache Heirachy Again
L1 icache | L1 dcache | L2 | |
---|---|---|---|
Size | 32KB | 32KB | 512KB |
Associativity Ways | 2 | 4 | 8 |
Write Miss Allocate | N/A (read only) | No | Yes |
Write Through/Back | N/A (read only) | Through | Back |
And remember we said the L2 is inclusive of the L1 dcache, but not the L1 icache.
Since the L2 is inclusive of the L1 dcache, only the L2 needs to snoop the EIB. Less agents on the bus means faster combined snoop responses.
The L2 is write back to reduce traffic on the EIB, but the L1 dcache is write throuh. Because of being write through, the L2 does not need to check if the L1 is modified when the L2 is exclusive, allowing faster transitions to shared.
As the L2 is inclusive of the L1 dcache, if a line that is in both needs to be evicted from the L2, then it must also be evicted from the L1 dcache. Being a unified cache, requests from the L1 icache can therefore cause L1 dcache evictions. There are the 2 and 4 ways in the L1 icache and dcache, by having 8 ways in the L2, this greately reduces the occurences of evictions like this.
Tips
Orgainse Your Data
This is by far the best thing you can do performance wise. If you get the same data in less memory accesses, this is a Good Thing. With competing constraints, this can’t always be acheived perfectly, but is worth aiming for.
Out Of Order Execution and In Order Execution
One of the big benefits of out of order execution is hiding cache misses. While waiting for the results of a load, the processor can reorder instructions so that it can do as much work as possible without blocking, effectively hiding most of if not all the latency. Reducing dependency chains helps the processor do more reordering, my first #AltDevBlogADay post describes this process (/2011/05/25/instruction-level-parallelism/) (though not in the context of cache misses).
In order cores don’t have it so easy here, but they don’t block immediately on the load either. An in order core only needs to block on the first instruction that uses the value that was loaded. This can be seen in a sampling profiler where you get an unusually high sample count on an unassuming instruction like say an XOR. If you look back to where the instructions inputs have come from, one of them could be a load.
This in order execution behaviour can be used to our advantage. If the code needs to do several scattered memory reads, then doing them all up front can overlap the costs. The processor will execute each of the loads before it needs to block. But, becareful of compiler reordering. Generally compilers will schedule assuming an L1 hit, since they can’t really know any better. Mike Acton mentions a handy GCC specific trick that can be used here ( can be used to prevent GCC from moving the loads from where you want them.
Prefetching
Prefetching data before it is required can most certainly be a big help, but it ain’t no silver bullet either. Prefetching in loops often needs a bit of tuning to get the right distance to prefetch ahead. Especially with fixed hardware (eg, consoles), you should always profile when your changing prefetching. Without care, prefetching can be close to pointless, and occasionally it can actually make things slower! Profiling is a more difficult problem on PC code, where it can be run on vastly differing machines.
Data streams
Breifly mentioning the PPC data streams here for completeness, but personally never had much success with these! Some special encodings of the DCBT instruction can be used to start and stop automatic loading of data into the L2. Several cases where i’ve previously tried using data streams, they just made things slower. Obviously if they were never useful, they wouldn’t be there, guess i just haven’t encounterred/noticed those cases yet.
DCBZ
This is an interesting PPC instruction. If you know that you are going to be overwriting an entire cache line (say with a large memcpy()), then DCBZ will put the cache line into modified state without ever loading it, then zero its contents. Usually when a cache line transitions from shared to exclusive, this is done with a DCLAIM bus command to gain ownership. DCBZ also uses DCLAIM, but the line can previously be in the invalid state.
Wrap Up
Ok, it may be slightly controversial to have had a discussion of caches without a single mention of address translation or Translation Lookaside Buffers (TLBs). Think that address translation deserves its own post entirely, so will cover this stuff next time.