All comments RSS

Like a B-tree, (as opposed to a binary tree), Ideal Hash Tries typically have a very wide branching factor, say 32 or 64. This massively reduces the amount of memory overhead, and pointer chasing, and this turns into a noticable speed boost on modern architectures. How wide you go probably depends on your data, and the size of your cache lines. (One size doesn’t fit all, remember?) But all you really need to think about in general terms, is that by having a widely fanned tree, the height is drastically reduced (rather than being Log2(N), it’s Log64(N), say – so for 16 million entries stored, instead of a tree of height 24, you have a tree of height 4. A LOT less pointers to follow!)

A detail, before we get to my particular implementation: Because hash-tries arrange items in the tree by hash, a decent hash function with good mixing and distribution is a must.Once you have that, no special code is needed for balancing the tree – no fancy red/black/treap action needed. You can fairly confidently say that the height of the tree will pretty much track Log64(N), as above. And all is well with the world…)

Now, to the implementation details. Ah, here is the devil! In the spirit of ‘one size does not fit all’, I decided to write quite a highly flavoured implementation. In particular, I have a simple C++ template class that implements Get, Set, and Delete (yes, I bothered with Delete! nobody ever does that in data structures class! Brownie points for me!) on a type T. But T is not entirely unconstrained:

- It must be easily copied around, I’m assuming it’s basically a pointer to some complex type. I freely make assignments of T’s in my code.
- It must be hashable, comparable for equality, and have a value that equates to ‘null’ or the empty trie (just like, er, a null pointer)
- It must have 1 bit of storage spare, to be used as a flag, and be at least the size of your system’s pointer. Why? because I use that flag bit to say ‘this bit of memory is not actually a T, it’s a pointer to an internal trie node’.

This leads to the first interesting design decision: The root of my tree is now actually a T (in the example, a u64). So a trie with 0 or 1 elements makes no memory allocations at all, and simply stores the null T, or the single stored T, respectively. Lovely.

As soon as you add a second entry, it allocates (using malloc) the first internal ‘mini hash table’ node. As implemented, this mini hash table takes just a single 8 byte header, plus the two ‘Ts’ that are being stored.

What happens when you add a third entry? It has to resize the mini hash table node, from 2 entries, to 3. if you look at the (trivial) definition and use of the routine Capacity() in the code, you can see I chose a simple resizing policy: I try to keep the 64-entry hash tables as small as possible in memory. To this end, I store a 64 bit bitmask in the head of the node (that’s the 8 byte overhead mentioned above), recording which entries in the theoretically-64-wide hash table are full. And then I only store ‘Ts’ (or internal pointers) for the set bits. Cue lots of bit twiddling, but very little wasted space!

Currently, the code will re-allocate the node every time it grows, keeping it as small as possible at the expense of lots of reallocations. (low memory overhead favoured over speed or lack of fragmentation). You could change Capacity() to force the nodes to grow in steps or powers of two, giving slightly more memory overhead in return for faster operation due to the reduced reallocations. Before you panic about fragmentation, consider that the code will only make allocations of 8+N*sizeof(T), where N only ranges from 1 to 64 (and fewer values still, if you choose a different Capacity() function). So hopefully, you’ll get lots of re-use – perfect for a pooling allocator, me’thinks.

Because the code stores both T’s, and pointers to internal nodes, in the same arrays, it needs to be able to differentiate between the two types at runtime. That’s why ‘T’ needs to have 1 bit ‘spare’ of storage.

One last thing before we get to the code: ‘style’. This one was a bit of an experiment; I had a few goals in mind, that may catch in your throat:

- I was consciously trying to keep things reading from top to bottom, keeping the indentation as low as possible.
- I was also going for a sort of ‘C+’ style, thinking in C-like terms but allowing a little bit of template-ness to creep in to avoid macros
- Probably influenced by jQuery, there’s a bit of ‘chaining’ going on. In other words, a function (often with side effects) will return the key ‘actor’, often just one of its parameters, in order to let the function be chained, for example to a return, within a single expression. To me, this decreases the ‘low code density’ you sometimes see in C++ code – and I like the resulting ‘relentlessness’ of it – every line moves things forwards as much as I can. You may not like it – and this is my first stab at it.
- In a similar spirit, I’ve got into the habit of ‘naming’ things if I use them in an if() or while() or for() header, and then again in the body. So I allow myself to make assignments as a side effect of an if() condition for example; in Duff’s article on the subject, he wrestles with his desire to do assignments like this, worrying about side effects where people least expect them. But I’d argue that this type of assignment is a special case of side effects – ‘naming’ – that is acceptable.

static T *Lookup(T &root, const T &k, bool get) {

This is the main tree-walking function; I wrap it for Get() and Set(), but they just call through with ‘get’ set appropriately, as the code is very similar. I hope the code optimiser notices this; if not, you could always put ‘get’ into a template parameter.

if (HashTrieIsEmpty(root)) // special case: empty trie return get ? 0 : HashTrieReplace(root, k);

Here’s my first special case: calling out to a user-overloaded predicate HashTrieIsEmpty (by default, comparing against NULL, which works if T is pointer-like), we decide if the trie is empty. In that case, we either fail-to-get-anything, or replace the root with the new value. Done, and done.

T*slot=&root; N*n; u32 k_hash = HashTrieGetHash(k);

In preparation for the tree walk, I prepare the ‘slot’ pointer: it’s a pointer to the previous step of the walk – the place in the parent node, where I should make any updates if I need to.

k_hash holds the hash of the value we are looking for (k), and we are going to consume it, 6 bits at a time: (in the code, FANOUT_LOG2 is 6 or 5, configurably at compile time)

for (int depth=0;;depth+=FANOUT_LOG2, k_hash>>=FANOUT_LOG2) { if (!(n = HashTrieGetNode(*slot))) // is this an inner node or a value? { // it's a leaf node if (HashTrieEqual(*slot,k)) return get ? slot : HashTrieReplace(*slot,k); if (get) return 0;

Here’s the first example of using ‘an extra bit’ in T, to determine if the memory at *slot is either a real T value, or just a pointer to an inner node. This test is done by user supplied predicate HashTrieGetNode, which in the default case, casts the T to a size_t and checks the bottom bit – which is what you want for pointers. You could do whatever you want for custom types: just make sure you have that one bit. In my example driver code, I use a hash-trie of ‘randomly’ chosen EVEN 64 bit integers (effectively, non-even 63 bit integers!), with the bottom bit left as room for the ‘internal node’ flag.

Here also is an example of ‘naming’ – if the slot is an internal pointer, we name it ‘n’; if n is null, we have a real, bona-fide ‘T’ – we’re at the leaf of our tree – and can see if it’s the value we’re looking for.

That test is handled by predicate HashTrieEqual – if we have a hit, we either return the found slot (get) or give the user a chance to replace the old value in the trie with the new one – via HashTrieReplace. Either way, we’re done!

If we missed, we carry on: in the get case, we can return saying ‘not found’. In the set case, we are now at the heart of the hash-trie algorithm: we have a clash – between the current leaf value and the new value k. We must allocate a new internal node to differentiate these two, and insert it in place of the old leaf. Let’s see how that is done, next:

T oldval = *slot; u32 old_hash = HashTrieGetHash(oldval)>>depth;

We have now prepared ourselves by grabbing the old value and its hash, and making sure that we snip off the parts of the hash we’ve already consumed, with that cheeky right shift.

I’m going to skip over the next short for loop, which handles an edge case, and look at the common path first:

return (depth>=30)?Alloc2Linear(k,oldval,slot):Alloc2(k_hash,k,old_hash,oldval,slot);

Here we just allocate a new internal node with 2 children: k (the new value), and oldval (the old value). the Alloc functions also write to the parent slot, replacing the old leaf value with the new internal node pointer. The ? operator is needed because of a nasty edge case: complete hash collisions.

In other words, what happens if you insert two T’s, that have the same hash? In this implementation, I ‘keep going’, adding up to 5 layers of trie: that uses up 30 bits of hash (depth==30). If, after 30 bits and 5 pointer chases the hashes are still the same, I give up, and store all colliding entries in a little linear block, equivalent to a vector<T>. That’s what Alloc2Linear does. Assuming a reasonable hash function, you need at least about 30,000 entries in your hash trie before you have a 50/50 chance of a single collision. In other words, collisions should be rare-enough, that these lists are rare/short enough that they are actually faster to search linearly than with something cleverer. The code can always tell that a node is a linear-list, instead of a nice bit-packed 64 entry hash table, because depth will be at least 30.

That brings me to the last special case: that for loop I skipped over. If we are inserting k, and it collides with the old_val, we need a 2 element node to differentiate them, right? Just like we allocated above – look at the first two orange nodes in the diagram: inner nodes, with 2 children, designed to differentiate two values. But what if the hashes *still* match at the next 6 bits? and the next? Each internal node of the trie just consumes 6 more bits of the hash – and so we need to make sure we have one internal node for each 6 bits of hash that are identical for our two values. ie, we need to try to find a place where the old and new hashes really do differ, and insert dummy nodes up to that point: these dummy nodes each have 1 element – forming a linked list of inner nodes – one for each matching set of 6 bits. This is rare though – you’d hope that your hash function rapidly differentiates your two values – so this for loop will hopefully run 0 times:

for (;depth<30 && (old_hash&FANOUT_MASK)==(k_hash&FANOUT_MASK) ;depth+=FANOUT_LOG2,k_hash>>=FANOUT_LOG2,old_hash>>=FANOUT_LOG2) slot=Alloc1(k_hash,slot);

That’s it for the meaty part of insert: resolving hash collisions, in all its nasty forms. All we’re left to do is the part that traverses down previously created inner nodes. (If you remember, way up top of this walkthrough, the first ‘if’ statement handled leaf nodes; now it’s time to handle inner nodes). That’s what this does:

if (depth>=30) break; // nodes above bit 30 are a linear list walk - handled outside the main loop T *child_slot = n->Lookup(k_hash); if (!child_slot) return get ? 0 : Insert(n,k_hash,k,slot); slot=child_slot; } // this is the end of our outermost 'for' loop, that walks the trie

The first line gets the edge case of linear list nodes out the way – we’ll deal with them in a mo. For now, let’s focus on ‘normal’ hash table nodes. It first looks up a child slot in the current slot’s bitpacked hash table (using Lookup – which uses the bottom 6 bits of k_hash to return a child). If it’s not found, we either return null (for get: “NOT FOUND!”) – or in the SET case, we insert the new value into the hash table, knowing that there is a nice empty slot for it to go into.

If we DID find something in that slot, we set it as the new slot to be considered by the main body of code, and let the outer for loop carry us back to the top, to handle whatever we found: either another internal node, or a leaf.

That’s it – save the last wrinkle of handling the linear lists for the case where there was a full 30 bit hash collision. The code is pretty straight forward:

// we've run out of tree! switch to linear list search. 30 bit hash collisions are hopefully very, very rare. T *child_slot = n->LookupLinear(k); if (child_slot) return get ? child_slot : HashTrieReplace(*child_slot, k); return get ? 0 : Append(n,k,slot);

That’s it for the walkthrough. And I’ve written far too much already for this blog post – timeout! I haven’t had much of a chance to profile this yet, let alone optimise & debug it. However, the included test program makes a half arsed attempt to put it through its paces: it inserts a million ‘random’ integers; then it does 2 million ‘gets’, 50% of which hit and 50% of which miss; and finally it deletes all million entries.

I also wrote the same test for a simple std::set. In many ways, this is a grossly unfair comparison – std::set maintains a nice ordering that you can use to iterate your set in-order; a closer comparison would be std::hash_set. But I couldn’t get it to work! So I gave up. I am an STL hash_set noob. Also, I’ve seen people use std::set (and equivalently std::map) many times where they had no need for a well-ordered set; so in some sense, the comparison has some tiny shred of validity, due to this common abuse of the STL…

I’m really making excuses – to test & benchmark this properly would probably take a whole other blogpost, and you’ve probably had enough by now. I certainly have. So here are the numbers, on my slightly crappy PC with win64 ‘release’ build – compiled with default options on MSVC 2010.

I tested 3 configurations: hash trie with a fanout of 32, hash trie with fanout of 64, and stl set. To my surprise, the fanout 32 was very slightly faster than fanout 64; YMMV.

The first column is the operation; the second the number of iterations (gets were arranged to be 50% misses, 50% hits); the last column is microseconds. I also ran the code a few times in order to warm things up – the numbers are from the fastest run I could get with each algorithm.

TRIE process working set went from 304k -> 31848k trie fanout 32: trie insert 1000000 333002usec trie get 2000000 435829usec trie get 2000000 422247usec trie get 2000000 423819usec trie delete 1000000 407811usec

trie fanout 64: trie insert 1000000 351821usec trie get 2000000 456112usec trie get 2000000 455337usec trie get 2000000 459914usec trie delete 1000000 419332usec

STL SET process working set went from 324k -> 94268k set insert 1000000 956649usec set get 2000000 1864157usec set get 2000000 1859870usec set get 2000000 1879553usec set delete