It seems that everyone is talking about data oriented design nowadays which is fantastic of course. I have seen it applied with great success to the areas of the game engine where it seems naturally suited. High throughput rendering systems are the main one where the data is typically already in a stream format. Rendering coders are used to thinking in terms of data flow, throughput, minimizing branching and writing code which operates on these with no side effects, mainly through HLSL and other domain specific languages.

I always get a bit stuck however when I try and think how to apply it to the typical game code we see in the wild. For one thing the data is rarely well defined up front. Requirements change frequently, mutate into things the original author never intended. The other major problem is that there is often thousands of man hours invested in the code base and change can be difficult. Hopefully you will see that there are big performance gains available without major coding style changes.

This post is aimed at the programmer who is familiar with data oriented design and why it is useful. It is not an introduction, more a validation with some real world examples and figures to back it up.

The Test

I decided to write a rudimentary flocking system. The behaviours typically involve searching for nearby neighbours and updating the agent based on what the neighbours are up to. In the end I settled on a subset of the flocking behaviour which I know is in no way representative of real world gamecode not least because it is fairly branchless. I would like to revisit this topic and expand on it, however the results I have certainly look promising and exhibit the performance gains I was expecting. Using a few simple design rules I was able to double the performance on the Xbox 360 with fairly minimal code changes!

I broke the test down into 2 main parts. The first part is where each agent iterates over all other agents doing a distance check. If the other agent is close enough it is considered a neighbour. The agent collects all the neighbours and averages the position which is used as the target point. The second step simply steers the agent towards the target point. A real flocking system has many more elements but they basically involve those two steps, collecting data about the world, then reasoning and acting on it.

My initial implementation used a class with position, velocity, target position and age. To make the test more applicable I also added a 192 byte payload to the class (in reality it might be a couple of matrices, health, weapons, whatever). This pushes the class over the typical cache line size of 128 bytes – one of the killers in having monolithic structures. In addition to this I wanted to simulate the fact that agents are typically created at various times which would mean they end up scattered around memory again increasing cache misses and reducing performance. This formed my baseline measurement.

The next evolution was to lay the classes out in order in a contiguous block of memory. This will improve cache performance somewhat and is commonly applied in games either manually or with a pool allocator.

The step after this was to change the data from the Array of Structures layout to a Structure of Arrays format. I created contiguous arrays of positions, velocities, etc. Then I changed the update code to look the elements up using an index and that was pretty much it. As a final test I created an SSE SIMD version of the Search for Neighbours on top of the stream layout for the PC platform.

The Results

The timings were performed with 1000 agents iterating for 100 ‘frames’. It repeats each test 5 times and takes the average. This table shows the speedup achieved for each of the methods over the baseline timing.
Search Update
contiguous stream simd contiguous stream
Atom N280 @ 1.6GHz x 1.5 x 2.6 x 6.7 x 1.0 x 2.0
Core2 Duo @ 2.33GHz x 1.4 x 2.5 x 3.9 x 1.0 x 3.8
Core i7 920 @ 2.67GHz x 1.5 x 2.1 x 3.2 x 1.1 x 1.2
Xbox 360 x 1.1 x 2.0 n/a x 1.0 x 1.5

Unfortunately with work and home commitments I was unable to find the time to write a simd version for the Xbox. I was also unable to run the tests on PS3 although I would expect the PPU results to be similar to the 360. The really interesting one would be SPU though. As the data needs to be streamed to the SPU’s the stream version of the code is ideal. You can just DMA the data that is required (position, velocity) and ignore the other streams, maximising the bandwidth. This makes it easier and more likely to be ported to SPU thus freeing the main cpu.

CPU Trace

I recorded a couple of CPU traces on xbox to highlight the differences between the Original and the Stream version. The instruction count was virtually identical, as well as the cost of the float compares. The only real difference between the two is summed up by these tables.

The actual instructions executed is very similar with the main expense being the calls to float compare when checking for neighbours. The difference is the cache usage. The stream version has near perfect cache usage whereas the original version uses very little of the data the cache retrieves from main memory.

Wrap Up

I hope these figures help inspire you to really think how you can lay your data out in a more cpu/cache friendly manner. Often gamecode is written in a very object oriented way and the authors do not consider memory performance, because a lot of the time the cost is insignificant to other systems.

As these results have shown I got a small gain simply by keeping the objects in a contiguous block of memory. By storing the data in a SoA format the performance doubled compared to the original. It was easy to do and keeping the data like that would make other operations faster as well. It does require more book keeping code to maintain the contiguous block when entities are added/removed but the overhead is almost certainly worth it.

The other option would be to split the data into separate hot and cold structures. Hot has the things used frequently (position, velocity) and cold would be the less frequently accessed. I find that this method is more complicated than simply going for a SoA layout across the whole class, although it does have some cache advantages if you need to skip over elements.

Finally SIMD can be a good win but it makes the code less flexible and harder to maintain. I would only use it if you know the code is approaching final and you need to squeeze a bit more speed out.

Source code is available at