Optimizing for the instruction cache

Instapaper Text

Optimizing for the instruction cache

Everyone on the performance coding front knows it; the cache will make you or break you. There is no need to stress it any further that if your data doesn’t fit in the cache and you don’t make sure that your access patterns are such that allows for large cache hits, you are going to suffer from terrible performance.

One of the major conversion reasons to DOD -that is an almost “hyped” term these days- is that with DOD you get to focus on arranging the data in cache efficient ways. However we mostly focus on the data cache, and we tend to forget about the instruction cache.

The role of the instruction cache is to accelerate the fetching of instructions from the main memory. This is exactly like the way the data cache is used to accelerate the fetching of data. After all your code is data! The difference is that it is much harder to optimize for the instruction cache than it is to optimize for the data cache. This is logical if you consider that we can freely arrange data in any way we can device. We don’t have that freedom with code layout. This makes most of us turn away from the problem and ignore it, while focusing on the data cache and hope all will be fine. Others don’t bother at all or simply get as far as minimizing code size. However it turns out that doing cache misses for instruction fetching is as bad as regular cache misses.

Don’t get me wrong, minimizing code size helps a lot. And yes; inlining every function will eventually hurt you. So let’s suppose we have carefully minimized code size, what else is there that will affect the instruction cache performance? I always like to start with an example so lets take a look at the code below:

void updateEntities(entity_t *entities, unsigned int numOfEntities) {
 
     for(int i = 0 ; i < numOfEntities ; i++){
 
        entity_t *p = &entities[i];
 
        switch(p->type){ 
 
        case 0:
 
          updateEntityA(p);
 
          break;
 
        case 1:
 
          updateEntityB(p);
 
          break;
 
        }
 
     }
 
  }

Consider now that we have two kinds of entities, each with its own update function. The above code runs over an array of entities and depending on the entity type it calls the appropriate update function. Assuming that the entities of the two types are randomly distributed in the array, we can easily see that the execution will alternate in the functions updateEntityA() and updateEntityB(). Now suppose that only one function can fit in the instruction cache at once. This would mean that for every entity we will be loading instructions from the main memory with 50% probability. If that sounds extreme, try thinking of it with ten different update functions alternating at random and running on a device like the iPhone. We can easily see that we can do better that that. What would happen if we grouped together entities of the same type? We could do that by having one array for every entity type, or by having a single array that we keep sorted on the entity type. This will change the code access patterns from:

updateEntityA(), updateEntityB(), updateEntityA(), updateEntityC(), updateEntityB()…. … ..

to:

updateEntityA(),….,updateEntityB(),….,updateEntityC(),….., … …

Now the CPU will be able to update all entities of type A with the code resident in the instruction cache, then all entities of type B with the code in the instruction cache, and so on. We effectively reduced the cache misses from the order of the number of entities, to the order of the number of entity types.

The point to be taken from the above is that there is performance to be gained by organizing data in a way that when processed the code will follow similar paths; whether that is by following the same branches consistently, or by dispatching the same virtual method of a parent class with various custom implementations as children. Critical for performance code paths should be tested and optimized for the instruction cache along with the optimization for the data cache.

Optimizing for the instruction cache can be a difficult balancing act, and with the fact that we don’t have the ability to exactly layout the code, we must try to understand what the compiler is doing under the hood and do lots of trials and benchmarks. Inspecting the machine code emitted by the compiler is also crucial. Unroll loops and inline functions with caution. I usually do it incrementally while monitoring the performance. However keeping the code small is not going to help you if you run all over your codebase doing “random” stuff. Keep it in order, and you shall receive!

#AltDevBlogADay

Charilaos Kalogirou
Follow @harkal