Obligatory Introductory Parable

I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why…

A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this…

Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this…

Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on. In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the malloc function and the new operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called fragmentation and it is a nightmare!

What’s wrong with malloc and new?

Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at all of the problems:

1. malloc and new try to be all things to all programmers…
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they’re allocating for you and what its lifetime is likely to be. Put another way, they don’t have the bigger picture that we have as programmers.

2. Run-time performance is relatively bad…
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application’s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. “Extra work” is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you’d want to be wasting valuable processor cycles on if you can avoid it!

3. They cause fragmentation of your heap…
If you’ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.

4. Tracking dynamic allocations can be tricky…
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, have a read of this. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.

5. Poor locality of reference…
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy. So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we’ve just discussed.

The Humble Linear Allocator

As the name of this section suggests this allocator is certainly the simplest of all those I’m going to present, although truth be told; they’re all simple (and that’s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.

A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here’s an example of how one might go about implementing a simple, linear allocator in C.

/** Header structure for linear buffer. */
typedef struct _LinearBuffer {

uint8_t *mem;       /*!< Pointer to buffer memory. */
uint32_t totalSize; /*!< Total size in bytes. */
uint32_t offset;    /*!< Offset. */
} LinearBuffer;

/* non-aligned allocation from linear buffer. */
void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {

if(!buf || !size)
return NULL;

uint32_t newOffset = buf->offset + size;
if(newOffset <= buf->totalSize) {
void* ptr = buf->mem + buf->offset;
buf->offset = newOffset;
return ptr;
}
return NULL; /* out of memory */
}

It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations. To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.

The Stack

Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the call stack, however, that stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this. Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so here. Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just before that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.

This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:

typedef uint32_t StackHandle;

void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {

if(!buf || !size)
return NULL;

const uint32_t currOffset = buf->offset;
if(currOffset + size <= buf->totalSize) {

uint8_t* ptr = buf->mem + currOffset;
buf->offset += size;

if(handle)
*handle = currOffset; /* set the handle to old offset */
return (void*)ptr;
}

return NULL;
}

void stackBufferSet(StackBuffer* buf, StackHandle handle) {

buf->offset = handle;
return;
}

The Double Stack

If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.

Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers. I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.

The Pool

We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized chunks, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.

So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a linked list. Again if you’re not acquainted with this data structure then try reading this. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.

When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.

Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.

/* allocate a chunk from the pool. */
void* poolAlloc(Pool* buf) {

if(!buf)
return NULL;

if(!buf->head)
return NULL; /* out of memory */

uint8_t* currPtr = buf->head;
buf->head = (*((uint8_t**)(buf->head)));
return currPtr;
}

/* return a chunk to the pool. */
void poolFree(Pool* buf, void* ptr) {

if(!buf || !ptr)
return;

*((uint8_t**)ptr) = buf->head;
buf->head = (uint8_t*)ptr;
return;
}

/* initialise the pool header structure, and set all chunks in the pool as empty. */
void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {

if(!buf || !mem || !size || !chunkSize)
return;

const uint32_t chunkCount = (size / chunkSize) - 1;
for(uint32_t chunkIndex=0; chunkIndex<chunkCount; ++chunkIndex) {

uint8_t* currChunk = mem + (chunkIndex * chunkSize);
*((uint8_t**)currChunk) = currChunk + chunkSize;
}

*((uint8_t**)&mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */
buf->mem = buf->head = mem;
return;
}

A Note on Stack-based Allocation (alloca is your friend)

Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:

myFunction() {

myTemporaryMemoryBuffer = malloc(myMemorySize);
/* do processing limited to this function. */
free(myTemporaryMemoryBuffer);
}

There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is alloca. How alloca works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns. There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on if your stack gets so large that it overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.

For more details you can consult your favourite search engine.

A Final Thought…

Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this excellent blog post. The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible.

Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.

In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.

Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!

More Information

Here are some links to related topics, or to topics that I didn’t have time to cover:

Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.
This post is reproduced over at my personal blog, which you can find
here.