A detail, before we get to my particular implementation: Because hash-tries arrange items in the tree by hash, a decent hash function with good mixing and distribution is a must.Once you have that, no special code is needed for balancing the tree – no fancy red/black/treap action needed. You can fairly confidently say that the height of the tree will pretty much track Log64(N), as above. And all is well with the world…)
This leads to the first interesting design decision: The root of my tree is now actually a T (in the example, a u64). So a trie with 0 or 1 elements makes no memory allocations at all, and simply stores the null T, or the single stored T, respectively. Lovely.
As soon as you add a second entry, it allocates (using malloc) the first internal ‘mini hash table’ node. As implemented, this mini hash table takes just a single 8 byte header, plus the two ‘Ts’ that are being stored.
What happens when you add a third entry? It has to resize the mini hash table node, from 2 entries, to 3. if you look at the (trivial) definition and use of the routine Capacity() in the code, you can see I chose a simple resizing policy: I try to keep the 64-entry hash tables as small as possible in memory. To this end, I store a 64 bit bitmask in the head of the node (that’s the 8 byte overhead mentioned above), recording which entries in the theoretically-64-wide hash table are full. And then I only store ‘Ts’ (or internal pointers) for the set bits. Cue lots of bit twiddling, but very little wasted space!
Currently, the code will re-allocate the node every time it grows, keeping it as small as possible at the expense of lots of reallocations. (low memory overhead favoured over speed or lack of fragmentation). You could change Capacity() to force the nodes to grow in steps or powers of two, giving slightly more memory overhead in return for faster operation due to the reduced reallocations. Before you panic about fragmentation, consider that the code will only make allocations of 8+N*sizeof(T), where N only ranges from 1 to 64 (and fewer values still, if you choose a different Capacity() function). So hopefully, you’ll get lots of re-use – perfect for a pooling allocator, me’thinks.
Because the code stores both T’s, and pointers to internal nodes, in the same arrays, it needs to be able to differentiate between the two types at runtime. That’s why ‘T’ needs to have 1 bit ‘spare’ of storage.
static T *Lookup(T &root, const T &k, bool get) {
if (HashTrieIsEmpty(root)) // special case: empty trie return get ? 0 : HashTrieReplace(root, k);
T*slot=&root; N*n; u32 k_hash = HashTrieGetHash(k);
for (int depth=0;;depth+=FANOUT_LOG2, k_hash>>=FANOUT_LOG2) { if (!(n = HashTrieGetNode(*slot))) // is this an inner node or a value? { // it's a leaf node if (HashTrieEqual(*slot,k)) return get ? slot : HashTrieReplace(*slot,k); if (get) return 0;
Here’s the first example of using ‘an extra bit’ in T, to determine if the memory at *slot is either a real T value, or just a pointer to an inner node. This test is done by user supplied predicate HashTrieGetNode, which in the default case, casts the T to a size_t and checks the bottom bit – which is what you want for pointers. You could do whatever you want for custom types: just make sure you have that one bit. In my example driver code, I use a hash-trie of ‘randomly’ chosen EVEN 64 bit integers (effectively, non-even 63 bit integers!), with the bottom bit left as room for the ‘internal node’ flag.
Here also is an example of ‘naming’ – if the slot is an internal pointer, we name it ‘n’; if n is null, we have a real, bona-fide ‘T’ – we’re at the leaf of our tree – and can see if it’s the value we’re looking for.
That test is handled by predicate HashTrieEqual – if we have a hit, we either return the found slot (get) or give the user a chance to replace the old value in the trie with the new one – via HashTrieReplace. Either way, we’re done!
If we missed, we carry on: in the get case, we can return saying ‘not found’. In the set case, we are now at the heart of the hash-trie algorithm: we have a clash – between the current leaf value and the new value k. We must allocate a new internal node to differentiate these two, and insert it in place of the old leaf. Let’s see how that is done, next:
T oldval = *slot; u32 old_hash = HashTrieGetHash(oldval)>>depth;
We have now prepared ourselves by grabbing the old value and its hash, and making sure that we snip off the parts of the hash we’ve already consumed, with that cheeky right shift.
I’m going to skip over the next short for loop, which handles an edge case, and look at the common path first:
return (depth>=30)?Alloc2Linear(k,oldval,slot):Alloc2(k_hash,k,old_hash,oldval,slot);
Here we just allocate a new internal node with 2 children: k (the new value), and oldval (the old value). the Alloc functions also write to the parent slot, replacing the old leaf value with the new internal node pointer. The ? operator is needed because of a nasty edge case: complete hash collisions.
In other words, what happens if you insert two T’s, that have the same hash? In this implementation, I ‘keep going’, adding up to 5 layers of trie: that uses up 30 bits of hash (depth==30). If, after 30 bits and 5 pointer chases the hashes are still the same, I give up, and store all colliding entries in a little linear block, equivalent to a vector<T>. That’s what Alloc2Linear does. Assuming a reasonable hash function, you need at least about 30,000 entries in your hash trie before you have a 50/50 chance of a single collision. In other words, collisions should be rare-enough, that these lists are rare/short enough that they are actually faster to search linearly than with something cleverer. The code can always tell that a node is a linear-list, instead of a nice bit-packed 64 entry hash table, because depth will be at least 30.
That brings me to the last special case: that for loop I skipped over. If we are inserting k, and it collides with the old_val, we need a 2 element node to differentiate them, right? Just like we allocated above – look at the first two orange nodes in the diagram: inner nodes, with 2 children, designed to differentiate two values. But what if the hashes still match at the next 6 bits? and the next? Each internal node of the trie just consumes 6 more bits of the hash – and so we need to make sure we have one internal node for each 6 bits of hash that are identical for our two values. ie, we need to try to find a place where the old and new hashes really do differ, and insert dummy nodes up to that point: these dummy nodes each have 1 element – forming a linked list of inner nodes – one for each matching set of 6 bits. This is rare though – you’d hope that your hash function rapidly differentiates your two values – so this for loop will hopefully run 0 times:
for (;depth<30 && (old_hash&FANOUT_MASK)==(k_hash&FANOUT_MASK) ;depth+=FANOUT_LOG2,k_hash>>=FANOUT_LOG2,old_hash>>=FANOUT_LOG2) slot=Alloc1(k_hash,slot);
if (depth>=30) break; // nodes above bit 30 are a linear list walk - handled outside the main loop T *child_slot = n->Lookup(k_hash); if (!child_slot) return get ? 0 : Insert(n,k_hash,k,slot); slot=child_slot; } // this is the end of our outermost 'for' loop, that walks the trie
The first line gets the edge case of linear list nodes out the way – we’ll deal with them in a mo. For now, let’s focus on ‘normal’ hash table nodes. It first looks up a child slot in the current slot’s bitpacked hash table (using Lookup – which uses the bottom 6 bits of k_hash to return a child). If it’s not found, we either return null (for get: “NOT FOUND!”) – or in the SET case, we insert the new value into the hash table, knowing that there is a nice empty slot for it to go into.
If we DID find something in that slot, we set it as the new slot to be considered by the main body of code, and let the outer for loop carry us back to the top, to handle whatever we found: either another internal node, or a leaf.
That’s it – save the last wrinkle of handling the linear lists for the case where there was a full 30 bit hash collision. The code is pretty straight forward:
// we've run out of tree! switch to linear list search. 30 bit hash collisions are hopefully very, very rare. T *child_slot = n->LookupLinear(k); if (child_slot) return get ? child_slot : HashTrieReplace(*child_slot, k); return get ? 0 : Append(n,k,slot);
That’s it for the walkthrough. And I’ve written far too much already for this blog post – timeout! I haven’t had much of a chance to profile this yet, let alone optimise & debug it. However, the included test program makes a half arsed attempt to put it through its paces: it inserts a million ‘random’ integers; then it does 2 million ‘gets’, 50% of which hit and 50% of which miss; and finally it deletes all million entries.
I also wrote the same test for a simple std::set. In many ways, this is a grossly unfair comparison – std::set maintains a nice ordering that you can use to iterate your set in-order; a closer comparison would be std::hash_set. But I couldn’t get it to work! So I gave up. I am an STL hash_set noob. Also, I’ve seen people use std::set (and equivalently std::map) many times where they had no need for a well-ordered set; so in some sense, the comparison has some tiny shred of validity, due to this common abuse of the STL…
I’m really making excuses – to test & benchmark this properly would probably take a whole other blogpost, and you’ve probably had enough by now. I certainly have. So here are the numbers, on my slightly crappy PC with win64 ‘release’ build – compiled with default options on MSVC 2010.
I tested 3 configurations: hash trie with a fanout of 32, hash trie with fanout of 64, and stl set. To my surprise, the fanout 32 was very slightly faster than fanout 64; YMMV.
The first column is the operation; the second the number of iterations (gets were arranged to be 50% misses, 50% hits); the last column is microseconds. I also ran the code a few times in order to warm things up – the numbers are from the fastest run I could get with each algorithm.
TRIE process working set went from 304k -> 31848k trie fanout 32: trie insert 1000000 333002usec trie get 2000000 435829usec trie get 2000000 422247usec trie get 2000000 423819usec trie delete 1000000 407811usec
trie fanout 64: trie insert 1000000 351821usec trie get 2000000 456112usec trie get 2000000 455337usec trie get 2000000 459914usec trie delete 1000000 419332usec
STL SET process working set went from 324k -> 94268k set insert 1000000 956649usec set get 2000000 1864157usec set get 2000000 1859870usec set get 2000000 1879553usec set delete