In (very) broad terms, there are two fundamental types of data structure – dense structures and sparse structures. Dense structures are those where all of the useful data is packed together in some form – for example, an array where items are always placed linearly, or a linked-list (note that in this instance “packed” does not necessarily mean “contiguous in memory”, but rather “contiguous in access”). When working with dense structures, you never need to see anything except the data you want. By comparison, a sparse structure has holes in it, like an array where items are written into elements at random – the useful items are spread out and in order to get to them you have to read in the blank bits as well. Common examples of sparse data structures in games are object tables and spatially-partitioned map data.
Different implementations of the two different approaches have their strengths and weaknesses, but to make a sweeping generalisation dense structures are hard to modify but fast to iterate through, whereas sparse structures are quick to modify but slower to iterate. The reason for this is, obviously enough, that walking a sparse structure involves touching all of the “dead” data as well as the live objects, with the associated penalties for memory access and cache misses. This is particularly painful when dealing with classic object-orientated designs where the “dead/alive” flag is part of the object structure, and therefore each flag is sizeof(Object) bytes apart. This means that for any case where your object size is larger than the data cache line width, the code will cause a cache miss for every single object, even though it is only fetching a single bit of information!The key to speeding this process up, therefore, is to move that critical piece of information – the dead/alive flag – out of the objects and into a separate parallel structure. Since there is only one bit needed, this can simply be a packed bitfield – 32 objects’ worth of flags in a 32-bit value being the usual pattern. By converting this:
class MyObject
{
bool mAlive;
Matrix44 mMatrix;
…etc…
}
MyObject ObjectList[128];
…Into this…
class MyObject
{
Matrix44 mMatrix;
…Etc…
}
MyObject ObjectList[128];
U32 ObjectAliveFlags[128 / 32];
…The access times for the case where you read only the alive flag (i.e. the object is dead and gets skipped) go through the floor. There is a penalty to be paid in the case where the object is alive, as in the first case the read of mAlive will pull in the first chunk of the object’s data “for free”, but in practice this is almost completely irrelevant. To be more concrete, on a system with 32 byte cache lines the characteristics look like this:
mAlive as part of the class | |
Min cache misses when all objects are dead | 128 |
Min cache misses when all objects are alive | 128 |
Min cache misses when n objects are alive | 128 |
mAlive as a separate bitfield | |
Min cache misses when all objects are dead | 1 |
Min cache misses when all objects are alive | 129 |
Min cache misses when n objects are alive | (n+1) |
(assuming that the processing of objects is not sufficiently memory-intensive as to evict the flags from the cache)
As you can see, whilst the worst case gets ever so slightly worse (by 0.7%), the best case time is massively improved, and in general the performance scales linearly with the number of objects which are alive.
At this point, some of you may be thinking “hey, isn’t this you sneaking in that new-fangled Data Oriented Design under a different name?”… to which I would have to put up my hands and say that you’re absolutely right. Essentially this is simply one step in moving from an AoS (Array of Structures) to SoA (Structure of Arrays) model. However, what’s cool about this particular example is that (a) it’s incredibly common and (b) you can get a massive benefit, as seen above, without doing any more refactoring beyond moving that one bool. If you already have an engine or aren’t ready to drink the SoA kool-aid right now, then it’s a great way to get a lot of the benefit for minimal time/effort investment. Using bitfields in this way opens up a couple more interesting tricks, too. Consider the case of walking through the array looking for items which are alive (equally, you might be allocating a new item, which is simply the same thing but looking for dead objects). Using a packed bitfield minimises the memory overheads, but you can also reduce the CPU’s workload very simply, too. For example, consider the following example of an iterator:U32* alive_flags_ptr = ObjectAliveFlags;
U32 current_flags;
int index = 0;
while(index < 128)
{
If ((index & 31) == 0)
{
// Load next set of bits
current_flags = *(alive_flags_ptr++);
}
if (current_flags == 0)
{
// Skip remaining items in this word
index = (index + 32) & ~31;
}
else
{
if (current_flags & 1)
{
// Object is alive, do something with it...
}
current_flags >>= 1;
index++;
}
}
…this uses a couple of interesting tricks. Firstly, by shifting current_flags down each iteration, the current object can be tested simply by ANDing it with 1 – pretty much guaranteed to be a very fast operation on any CPU (the assumption is made that when the bitfield was written, the LSB represents the lowest index). Secondly, when current_flags is 0, the iterator knows that there are no objects alive in the current “chunk” of bits, and skips straight to the start of the next 32-object U32. If the entire U32 is empty, then current_flags will be 0 from the word go and this will happen straight away. Some CPUs also provide a “Count Leading Zeros” or “Count Trailing Zeros” instruction, which allow the further optimisation of simply skipping over blocks of zeros as they appear in the middle of the bitfield, making the traversal even faster still.
If you have a particularly large data structure, it may be worth having multiple bitfields – for example, one game I worked on recently stored separate bitfields for each type of game object (as well as a global “is this object alive” bitfield), allowing for very rapid iteration over subsets such as “all of the bullets in the level”. The fact that bitfields can have boolean operators applied on them can be very useful, too – for example, iterating over “all bullets AND lasers” or “all enemies NOT tanks”. For very large (and very sparse) structures, it can occasionally even be worth having a cascading hierarchy of bitfields – where the second level bitfield uses one bit to represent 32 bits of the original bitfield (0 if it is completely empty, 1 if it contains anything). One notable gotcha with using bitfields in this manner is that if the number of bits being stored is not a multiple of 32, it will need to be rounded up – and with that comes the ancillary gotcha that care must be taken not to iterate over these extra “padding” bits (depending on circumstances, it may be enough to simply set the padding bits to all zeros or all ones, thereby causing them to get skipped, but this can get messy if searches for both alive and dead objects exist in the code). In summary – bitfields can be an incredibly useful tool when dealing with sparse data structures, as they are both very memory efficient and fast to process. And applying even a little bit of DOD lateral-thinking to problems can yield surprisingly effective results!(edit: the original version of the iteration sample had a bug where “skip remaining items” would skip the wrong number if the entire word was blank – this version has it fixed)