Data Structures: One size does not fit all

Instapaper Text

Data Structures: One size does not fit all

Whether or not you subscribe to Data oriented Design, you really ought to care deeply about the data-structures you use to describe your game’s state. Whether it’s a scenegraph, asset library, server-side data, user names, even string tables, a good data structure will make the difference between usable and unusable; between even performance and lumpy performance; between bounded and unbounded memory use.

Mistakenly, many people believe you should just use generic libraries and containers and never enter the messy world of the red-black-tree delete. Well, for non-critical code (ie 99% of it) that is probably true; STL is fine for utilities, and even non-critical client code (though not everyone here will agree), and many studios have their own preferred handbag of containers that suit their particular needs (EA STL, etc; at MM we have various stl vector-alikes, with more constrained memory patterns, some string classes that dont suck, and so on). But already, you’ll find fragmentation of opinion and preferred implementation, from one studio and coder to the next. One size is already, not fitting all.

And while template meta-programming has brought a fair dollop of generic programming lushness to C++, it remains a rubbish language for really re-usable, configurable container code. Yes, with traits and all sorts of magic it is possible, maybe, with great pain, to make things very configurable – but what a mess! eg, STL allocators. I rest my case. What is the issue I have with this, and the steaming hillock that is, for example, boost? Answer: That programmers end up using their precious brain cycles (which are more scarce & costly than cpu cycles) understanding the idiosyncrasies of a crufty API or its hideous template-C++-error-messages, instead of using the same cycles tailoring their data structures to their needs, directly.

Yes, I am once again advocating re-inventing the wheel – but only once or twice per big project, and only where it REALLY, REALLY matters. A nice side effect of rolling your own, will be that you really get to know your data, and the way it is used. Your code and the runtime will benefit hugely. (After all, data structure choice is as much about how you’ll use & access data, as the data itself).

I’d say there are 3 levels, in the RPG sense, of programmer-data-structure-knowledge:

level 1 – the noob – I dont know what a data structure is, let alone what O(log N) means.
level 2 – Oh I know all about O() notation, I post to reddit regularly, but ask me to write a balanced tree delete function and I’ll tell you to go and read the stl/boost/favourite-library manual
level 3 – Aha! I’ve read altdevblogaday for years and I know my treaps from my tries.

Well, what can I say? level 1, please go and read the internet and come back when you’re done. Level 2 – good on ya mate, that’ll probably cover you for 99% of your coding life. But stop reading, life’s too short to implement a new hashmap, eh? better get back to your AssetClassFactory and refactor that class heirachy one more time. Level 3 – ah, my brother, how I love thee. Read on, for you’ll never tire of the beauty of a Fibonacci heap, even though you know you’ll never actually write one.

still here? jolly good.

The main reasons I find myself reaching outside the STL bag are:

I need some additional statistics on the data; most commonly, so-called order statistics. Canonical example: you have a set of high scores in a container, sorted by score. What rank (ie index, if listed out in order) would a score of 1,000 get? What is the score of the person with rank 53? the underlying data structure of most STL (multi)sets, the red-black tree, can answer this question efficiently with the addition of a single int per node; alas it wasn’t in the grand plan of SGI or whoever it was who wrote the thing. Moving on.
I actually care about my memory usage, or allocation patterns. It’s just a fact of life that your particular memory/performance tradeoff will not always be exactly that chosen by the STL chaps. Not their fault, they just weren’t around when you designed your particular use case. In some rare cases writing an STL allocator is one option, but if you actually want to change the node structure (eg do/dont store parent pointers in a tree; make the nodes intrinsic/extrinsic) – or anything wackier – you’d better get down to the nitty gritty and leave that stl in the dust.
the stl simply doesn’t (portably) implement the data structure you’re after. cough hashmaps cough
you are a masochist and you have time to kill

once you leave the cosy confines of the stl, it can be liberating. do you have ANY IDEA how many awesome data structures there are out there? it’s like a primordial soup of ideas and bizarre tree types. There are many, many, good papers to read but here are some pointers to some of my favourites. If you don’t know them, go check them out. If you do, nod sagely as you read this, my assorted collection of some you might not have heard of . Give youself +1 smug points for each you know, and +10 for each you’ve implemented. Either way, know that you can absolutely KICK THE SHIT out of a generic implementation of a container, or a bad choice of underlying data structure, if you tailor your implementation to your specific use case and your specific data set. This is another example of generalisation having an under-estimated cost (or, over-estimated benefit). at least IMHO.

push to front queue

this is nearly always implemented as a doubly linked list, and the idea is simple: each time you use an element in the list, you move it to the head. This is handy because commonly used things bubble to the front in a nice, self balancing way; it’s great, for example, as a really simple way to implement an LRU cache, perhaps where you know that the caller uses some results a lot, and some a little, and that the distribution changes over time.

It’s also used as a pre-process in compression algorithms like the BWT – I refer you to the internet.

the treap

this cute little wikipedia again) to maintain the tree as a valid binary tree, but ALSO as a heap when viewed in terms of the priority. That means, the root node will always have the highest priority. This, my friends, you might be able to use to your advantage!

This has a few benefits: the code is really, really simple & short; if your priorities aren’t just random numbers, in certain special cases (like, the ones where you want to use a treap), you can actually use a secondary value (time, or, er, a priority, or whatever) as the priority directly – to allow you to pop off the highest priority (Which is always the head of the tree) in O(1) time. This is a good example of what I’m going to call ‘dual porting’ below, so read on for more.

the trie and friends

there are a bunch of variations of this, but the way to think about it is this: if you were storing a dictionary of 1m words, you’d want to store it in a way that efficiently exposed the redundancy between the keys ‘altdevblog’ and ‘altdevchat’. why store the ‘altdev’ part twice? you can share them, surely? indeed. sticking with english, you would have a tree structure that fanned out 26 ways (or however large your alphabet is) at each node; the root represents the empty string, and each edge/child pointer you walk down, indicates each letter of the key in turn. indeed, you wouldn’t even need to store the keys at all – they would be implicit in the ‘position’ in the tree you ended up at. so ‘altdevblog’ would involve following the edges from the root, along ‘a’, then ‘l’, then ‘t’, etc. when you get to ‘altdev’, you’d have a node with children for b and c, corresponding to ‘altdevb…’ and ‘altdevc…’, and so you’d walk down the relevant path. if that made no sense, you guessed it… wikipedia.

When using a trie, you have 3 implementation details to work out, and what you choose for each entirely depends, as usual, on the distribution of keys in your data:

what radix (fan size?) – how many kids do your keys have? you could have a binary trie, and just branch based on successive bits of the key (good way of storing huffman tables, maybe?); or, fan 26 ways for character text; or 256 ways for UTF8. (note, the word radix does indeed relate back to radix sort, see Steven Tovey’s article on altdevblogaday – a trie is, viewed with sufficient squint, like a data-structure form of a radix sort). Depending on what you choose, you’ll be trading speed for memory, and your cache line size, as well as the data, will come into play in deciding.
how to deal with sparse nodes? meaning, if you’re fanning 256 ways, if your keys are lumpy (eg english), you’ll find not many nodes have all their 256 children slots used up. you don’t want 254 null pointers in memory just to store the 2 non-null ones, so most people break out their bit-twiddling skills at this point, and pack together the nodes in yet another memory/speed tradeoff that suits the particular use-case.
Lastly, you may realise that many of your nodes have ‘arity’ of 1 – meaning they only have 1 kid. This is caused by long, common runs in your keys. For example, lots of keys that start ‘altdevb’ are probably going to continue at least to ‘altdevblog’, before branching. the chain of nodes a-l-t-d-e-v look like linear bits of tree – what a waste! Some implementations choose to reduce memory footprint, by storing these runs compactly without lots of nodes. Look up critbit trees for a particularly elegant implementation of this in the binary case.

And that’s tries. There’s a nice twist on tries, while we’re here, where you use the HASH of your keys rather than the keys directly. Assuming a non-lame hash function, you can now guarantee, malicious users aside, that your key space is pretty evenly covered. That means you can choose your radix based on other factors, like your processors cache line size; and, you can make assumptions that your trie will be well balanced (making its performance very even). Look up ‘ideal hash tries’ (PDF warning) for a really nice implementation. You lose the ability to iterate your trie in alphabetical order, of course… but that might be ok, depending on what you want.

the skip list

I think of a skip list, which gained popularity only recently, as a sideways tree, or a linked list with fast lanes. The idea is simple: start with a (singly) linked list, and observe that finding the 500th element in it sucks performance wise. Then, for each node, add a small number of extra forward pointers, stacked above the original linked list ‘next’ pointer. I call that a node of height ‘n’. how do you choose n? each node gets its own height – with probability P^n (where P=0.5 or 0.25, typically), and n is a ‘random’ small integer. (Like the treap, interesting definitions of random can be used here). The memory cost, in the case that P=0.25, is only an average cost of 1/3 of an extra pointer – ie the height on average is 1.333 pointers, ie less than 2 bytes overhead compared to a linked list, on a 32 bit system – (and graphics programmers should see where that 1/3 comes from immediately).

By making these ‘fast lane’ pointers ‘skip’ over to only point at the nodes that are of equal or higher height (see diagram), you can turn linear searches into tree-like log(n) ones. First you walk the really high nodes, skipping forward really fast; then, if you overshoot, you drop down to progressively slower ‘lanes’ until you are walking along the original linked list. logarithmic search performance is yours! Joy!

Like the treap, the skip list is self-balancing on-average thanks to its randomization, so it requires very little care and feeding; and that in turn leads to simple code.

The downside, like all randomized algorithms, is that you only get the performance ON-AVERAGE; if rand() is feeling naughty or your data set is tiny, odd things can happen.

order statistics

While we’re in the mindset of skipping, you can probably see how you might implement the order statistics I mentioned at the top of the post: you store along with each forward pointer, a counter of how far you’re skipping forward when you walk down that link. Finding the rank of a node is then merely a case of book-keeping as you skip down the fast-lane. Ranked-trees work in a similar fashion: for each node, you store the total number of children nodes that you have (defined recursively in terms of your children: my count = 1 + sum of all my kid’s counts). The root can now tell you in O(1) time how many nodes are in the tree, and with elementary arithmetic you can do all the rank-type operations you could ever want. Try shoe-horning that into an STL set…

‘dual-porting’

I could go on for much longer, but as you can tell, I’ll just end up referring you to wikipedia. So, now you have a taste for it, go forth and read! Even better than wikipedia, go and read Knuth volume 3: Sorting & Searching. It’s amazing. And THEN read the internet. While you’re away, I’ll just clean up the last thing I mentioned above: dual-porting.

What I mean by this, is the case that you have some data structure that you want to efficiently look up into, indexed by several (normally 2) keys. I’m not sure if dual-porting is an official term, but it speaks to me of hardware design, and cache layouts, and register files, and I’ll just say that hardware ideas like that are a rich source of inspiration for the intrepid data-structure-naught.

Dual porting is surprisingly common: the file table you want to index by name AND numerical-id or CRC; the high score table you want sorted by user name AND high score; the cache that you want to look up by key, but also be able to evict old entries from (LRU).

How you implement this depends entirely on your data, and how you use it, which is why, once again, one size does not fit all. Often, you’ll end up using two ‘indices’ (in the database sense) (ie two data structures) to handle each ‘port’; but that gives you two lovely axes of algorithm choice! the joy! For example, the cache example can elegantly be solved with a push-to-front list: you store your data in a set sorted by key, in a tree or trie or treap or hashmap or whatever, and each time you find/touch/edit a node, you push it to the front of a (separate) linked list. Now, when memory gets tight, you just ‘free’ elements from the tail of the list, in LRU order. Lovely.

This illustrates a subtle point: you COULD have had two fully searchable / sorted data structures: a set/tree/trie/treap by key, and a set/tree/trie/treap by last-access-time. But that is overkill: you end up paying the maintenance cost of two trees, where you need only pay the cost of one tree and a single O(1) list update. You’ve sacrificed some flexibility, but by knowing your data, and the uses of it, and choosing your data structures accordingly (not being afraid to implement the ones you need that are not vanilla) you can improve performance AND memory consumption.

the end

In summary: One size definitely does not fit all! Reinvent the wheel! Enjoy data! Profit! Disagree in the comments!

#AltDevBlogADay

Alex Evans