So we all know that cache misses kill performance, since reading from memory is slower than tectonic plate movement. Reasonably recently, Data Oriented Design has gained a lot of traction for improving cache friendliness; but how do caches actually work ?

This post will uncover some of their inner workings. Going to be reasonably theoretical, so other than a few small tips near the end, this is not really a “do this and your code will be faster” kind of post. More aimed at fitting another piece to the puzzle of groking the system.

First up we’ll cover some of the basics of caches, then we’ll get into how cache coherency protocols work.

Associativity

Obviously the idea behind a cache is that if a memory location is accessed once, it or nearby locations are likely to be accessed again soon after. The granularity of the blocks of memory a cache works with is the cache line size. One cache design is that when memory is accessed, the CPU rounds the address down to the first cache line boundary, and then looks throughout the entire cache to see if that line is currently there. If it is not currently in the cache, then something else is evicted to make room. This design where the entire cache is checked, is called a fully associative cache, memory may be loaded into any cache entry.

A more common design than a fully associative cache, is a set associative cache. Each memory location is associated with a set of cache entries, or “ways”. When a memory address is checked to see if it is in the cache, the memory address is used to form an index to the correct set, then only each of the ways in the set needs to be checked. The set index of the memory address is formed by masking with the size of a way, then dividing by the size of a line.

For example, lets say we have a 32KB, 4-way set associative cache, with a 128-byte cache line size. The total cache size is 32KB, so each of the 4 ways must be 8KB (0x2000). So if we accessed location 0xdeadbeef, we’d use the set index ( 0xdeadbeef & 0x1fff ) >> 7 = 0x3d.

A fully associative cache can be seen as a special case of a set associative cache, where the number of ways is the total number of cache entries. The other extreme, a 1-way set associative cache, is known as a direct mapped cache.

Generally you don’t need to worry too much about how many ways there are in a cache, but it can be important to keep in the back of your mind when working with multiple data streams at the same time. The following pathological case shows how bad it can get when things go wrong,

 
  float sum_floats(
 
   const float* __restrict__ stream0,
 
   const float* __restrict__ stream1,
 
   const float* __restrict__ stream2,
 
   unsigned count ){
 
    float total=0.f;
 
    for( unsigned i=0; i<count; ++i ){
 
      total+=stream0[i];
 
      total+=stream1[i];
 
      total+=stream2[i];
 
    }
 
    return total;
 
  }
 
  

Say this code was run on a CPU with a 64KB, 2-way set associative cache, and the input pointers were stream0=0x00008000, stream1=0x00010000, stream2=0x00018000. Here stream0[i], stream1[i] and stream2[i] all map to the same set. If the cache line size was 128-bytes, then each miss would fetch 32 floats from memory. But the cache line will be evicted before we get to read the next float from the same stream, we will get a cache miss on every access!

Of course, even once the cache aliasing problem is fixed, that code example is still nowhere near optimal : ) And if you want accuracy, then a Kahan Summation is much better ( L1 icache L1 dcache L2 Size 32KB 32KB 512KB Associativity Ways 2 4 8 Write Miss Allocate N/A (read only) No Yes Write Through/Back N/A (read only) Through Back

The L2 is inclusive of the L1 dcache, but not the L1 icache. Also, as well as the 8-way set associative mode, the L2 can be configured to directed mapped (L2_ModeSetup1.RMT_Mode=11b).

CPUID

On x86 processors, the CPUID instruction can be used to determine cache configuration. This is a fairly complex instruction, and there are actually several ways to call it and get cache information. The older way of EAX=2, is quite painful, as it requires software to keep a very large lookup table of values to decode the output. Newer CPUs support EAX=4 which gives much easier to parse output. AMD and some other non-Intel CPUs also return cache info for EAX=0x80000005 and EAX=0x80000006, though even Intel CPUs that support larger EAX=0x800000xx values do not support returning cache information this way (at least for my Nehalem Core i7).

).

Coherency

When there are multiple processors in a system, then their caches need to communicate with each other in order to keep the whole system consistent. Two different caches should never disagree on the contents of a line. The mechanism by which the caches talk to each other is known as a cache coherency protocol.

There are two main categories of coherency protocols, snoopy (*) and directory based. Snoopy protocols work by every cache being able to see every memory transaction on a bus. Caches watch/snoop other bus transactions to see if any of them affect any data it currently has cached. Snoopy protocols are fast for small numbers of bus agents, but they do not scale so well. Directory based protocols handle large scale systems better, but tend to be slower for small systems. Snoopy protocols are what we’ll cover here.

There are actually systems where no automatic consistency is guaranteed, but we’ll ignore them here.

(*) really wanted to insert a image here of Snoopy the cartoon dog, the detective ones with him and a magnifying glass would have been perfect, but it’s very unclear on the copyright status of all the images floating around the net : (

MESI

A common snoopy protocol is MESI. MESI has been used by Intel in its x86 processors since the original Pentium. It’s an acronym for the four states a cache line may be in in this protocol, Modified, Exclusive, Shared or Invalid. The following table, shamelessly stolen from “Intel® 64 and IA-32 Architecture Software Developer’s Manual, Volume 3A” ( Cache Line State M (Modified) E (Exclusive) S (Shared) I (Invalid) This cache line is valid ? Yes Yes Yes No The memory copy is … Out of date Valid Valid - Copies exist in caches of other processors ? No No Maybe Maybe A write to this line … Does not got to the system bus. Does not go to the system bus. Causes the processor to gain exclusive ownership of the line. Goes directly to the system bus.

There are two basic ways a cache line can change state; an action by the processor who owns the cache, or an external bus event.

The following diagram shows the basic CPU initiated ways that a cache line can change state.

When the CPU performs a load and the cache line was currently invalid, then if no other cache currently held a copy, it becomes exclusive, else shared.

When the CPU performs a store, the cache line will become modified. If the line was previously exclusive, then the transition to modified can be done without sending any commands to the bus. If the line was previously shared, then a “data claim” (DCLAIM) command needs to be sent to gain ownership. And if the line was previously invalid, then a “read with intent to modify” (RWITM) command both reads the data and gains ownership.

Now the previous diagram was just to show the basic ideas, it is not comprehensive. Some CPUs have additional cache control instructions that can cause other state transitions, for example an instruction to flush the cache could change a line from modified to invalid.

The next diagram shows the state transitions triggered by snooped bus commands.

When a cache has a line in either modified or exclusive state, and it snoops a read of this line, it intervenes on the bus, and gives a copy of the data to the reader. If the line was in a modified state, it is also then written out to memory. The line then transitions to the shared state.

When a cache has a line in any of the valid states, modified, exclusive or shared, and it snoops a command to invalidate the line, the state changes to invalid. If the line was modified, then the contents are written out to memory. Here “invalidate” isn’t a single bus command, rather it is one of the effects of several different commands. The RWITM and DCLAIM commands mentioned earlier, are two examples that fall under the “invalidate” banner in the previous diagram.

There are many similar cache coherency protocols like MSI, MOSI, MOESI, MERSI, etc. Again these are acronyms for the possible cache line states.

MERSI

To confuse matters, there are actually at least two different MERSI protocols.

Sharma and Grochowski of Intel filed US patent 6857051 (). The recent state acts very similarly to the shared state, and is really just an optimization. When a cache reads a line, if there is another cache that has the line in the recent state, then that other cache will intervene on the bus and provide the data. The cache that just read the line will then be put into the recent state (it is the most recent cache to have read it), and the other cache will revert to shared. At most only one cache can hold a line in the recent state at any moment. Shared state lines will never provide the data via intervention. The point of this is to reduce bus traffic as you won’t get all the caches jumping in at once and requiring arbitration. Though the down side is that the recent line could get cast out, then the next read will be serviced by memory, even if there are other caches that have the line in the shared state. Some other IBM documentation refers to the recent state as “shared last” (SL).

Cache Coherency In The Cell

The Cell extends (the “recent” version of) MERSI further still, using MERSI plus unsolicited modified (Mu) and tagged (T) states. IBM didn’t seem to want to follow the previous acronym trend, probably because there are so many these days, and as we have seen, they are starting to clash. Though i do like the sound of MERSIMuT : )

If at this point your thinking “but in the Cell, only the PPE has a cache, not the SPEs, so why all this ?”, the answer is that the SPEs do actually each have a cache. The atomic unit in each SPE is a in fact a small cache. And this cache coherency protocol allows for very fast access to shared data on the Element Interconnect Bus (EIB).

Again we need to do some investigative work to find out what these unsolicited modified and tagged states are. The CellBE Handbook only gives their names. The “IBM (e)server POWER4 System Microarchitecture” ( Cache Line State M (Modified) E (Exclusive) R (Recent) S (Shared) I (Invalid) Mu (Unsolicited Modified) T (Tagged) This cache line is valid ? Yes Yes Yes Yes No Yes Yes The memory copy is … Out of date Valid - - - Out of date Out of date Copies exist in caches of other processors ? No No Maybe Maybe Maybe No Maybe A write to this line … Does not got to the system bus. Does not go to the system bus. Causes the processor to gain exclusive ownership of the line. Causes the processor to gain exclusive ownership of the line. Causes the processor to gain exclusive ownership of the line. Does not got to the system bus. Does not go to the system bus.

BTW, if you’ve ever looked at some of the Cell performance counters and thought “WTF do these mean?”, then understanding the cache states will have cleared some of them up.

The PPE Cache Heirachy Again

If we now look back at the PPE cache heirachy, we can see the reason behind some of the design choices. Here’s the table from earlier,

L1 icache L1 dcache L2
Size 32KB 32KB 512KB
Associativity Ways 2 4 8
Write Miss Allocate N/A (read only) No Yes
Write Through/Back N/A (read only) Through Back

And remember we said the L2 is inclusive of the L1 dcache, but not the L1 icache.

Since the L2 is inclusive of the L1 dcache, only the L2 needs to snoop the EIB. Less agents on the bus means faster combined snoop responses.

The L2 is write back to reduce traffic on the EIB, but the L1 dcache is write throuh. Because of being write through, the L2 does not need to check if the L1 is modified when the L2 is exclusive, allowing faster transitions to shared.

As the L2 is inclusive of the L1 dcache, if a line that is in both needs to be evicted from the L2, then it must also be evicted from the L1 dcache. Being a unified cache, requests from the L1 icache can therefore cause L1 dcache evictions. There are the 2 and 4 ways in the L1 icache and dcache, by having 8 ways in the L2, this greately reduces the occurences of evictions like this.

Tips

Orgainse Your Data

This is by far the best thing you can do performance wise. If you get the same data in less memory accesses, this is a Good Thing. With competing constraints, this can’t always be acheived perfectly, but is worth aiming for.

Out Of Order Execution and In Order Execution

One of the big benefits of out of order execution is hiding cache misses. While waiting for the results of a load, the processor can reorder instructions so that it can do as much work as possible without blocking, effectively hiding most of if not all the latency. Reducing dependency chains helps the processor do more reordering, my first #AltDevBlogADay post describes this process (/2011/05/25/instruction-level-parallelism/) (though not in the context of cache misses).

In order cores don’t have it so easy here, but they don’t block immediately on the load either. An in order core only needs to block on the first instruction that uses the value that was loaded. This can be seen in a sampling profiler where you get an unusually high sample count on an unassuming instruction like say an XOR. If you look back to where the instructions inputs have come from, one of them could be a load.

This in order execution behaviour can be used to our advantage. If the code needs to do several scattered memory reads, then doing them all up front can overlap the costs. The processor will execute each of the loads before it needs to block. But, becareful of compiler reordering. Generally compilers will schedule assuming an L1 hit, since they can’t really know any better. Mike Acton mentions a handy GCC specific trick that can be used here ( can be used to prevent GCC from moving the loads from where you want them.

Prefetching

Prefetching data before it is required can most certainly be a big help, but it ain’t no silver bullet either. Prefetching in loops often needs a bit of tuning to get the right distance to prefetch ahead. Especially with fixed hardware (eg, consoles), you should always profile when your changing prefetching. Without care, prefetching can be close to pointless, and occasionally it can actually make things slower! Profiling is a more difficult problem on PC code, where it can be run on vastly differing machines.

Data streams

Breifly mentioning the PPC data streams here for completeness, but personally never had much success with these! Some special encodings of the DCBT instruction can be used to start and stop automatic loading of data into the L2. Several cases where i’ve previously tried using data streams, they just made things slower. Obviously if they were never useful, they wouldn’t be there, guess i just haven’t encounterred/noticed those cases yet.

DCBZ

This is an interesting PPC instruction. If you know that you are going to be overwriting an entire cache line (say with a large memcpy()), then DCBZ will put the cache line into modified state without ever loading it, then zero its contents. Usually when a cache line transitions from shared to exclusive, this is done with a DCLAIM bus command to gain ownership. DCBZ also uses DCLAIM, but the line can previously be in the invalid state.

Wrap Up

Ok, it may be slightly controversial to have had a discussion of caches without a single mention of address translation or Translation Lookaside Buffers (TLBs). Think that address translation deserves its own post entirely, so will cover this stuff next time.