So, as I mentioned MetalScroll, which includes some basic syntax colorising of C/C++/C#, but not Unreal Script.

Since I’m now working with Unreal Script some of the time, and the plug-in is open source, I decided to see if it would be straightforward to extend that functionality. My assumption was that it would be pretty easy, and in the end it was.

But that isn’t really what I want to write about now. It is what that work led me too that I think is interesting.

See, turning on the syntax highlighting for Unreal Script wasn’t too hard, but as it stood – as it still stands as I write this – the keyword colouring code in MetalScroll still only recognises C++ keywords, because when I dug deeper what I discovered was fascinating and took me off on a total tangent.

The initial work was done on office time – it was a worthwhile effort to boost productivity of multiple teams where I work by improving a popular tool, but then I dug a little deeper: How hard would it be to make it use an Unreal Script specific keyword list when highlighting Unreal Script? All I have to do is find the list of keywords, and swap it out for a different list depending on filetype. Right?

So away I go reading the code: C code produced by gperf version 3.0.3 */

Interesting.

Google: gperf.

Result: It’s a GNU tool for generating hash functions.

GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.

That, right there, is gold. Right away I think back, all my years in the industry, and almost all the problems I’ve ever seen with asset names and hash functions, would have been solved. Like magic. Beautiful. If you can feed it a full list of your asset id names, it can spit out a hash function with guaranteed no clashes.

If you use asset names and hash them, and I’m betting a lot of people do, because I’ve seen it a ton of times, then you could to be using this.

But I didn’t stop there.

That’s a pretty tight way of spotting keywords. A fast hash, and a string compare.

But we still need to look at some of the characters in the input string more than once. There is an extra 1 to 3 reads from the input string to build the hash, before the character-by-character string compare.

If we are generating code for a fixed list of keywords (which is reasonable) then that could be improved on, I’m sure.

So I thought about it, and came up with this: If we treat the keyword list as a possibility tree, where each character narrows down which possible keywords it could be, we can stop as soon as we know it’s not a keyword, and if we are still “in” the tree when we hit the end (of both the keyword and the input string) we have a winner.

I started thinking through ways of building the tree structure with look up tables in static data, and then it dawned on me: since it’s generating code, the tree can be directly encoded in the structure of the source code. We can let the compiler do something C and C++ compilers tend to be very good it: Compiling switch statements.

So that evening I went home and I wrote this: gen-is-keyword-fn, and stuck it up on google-code.

It’s not finished yet – I’ve been absurdly busy recently (this post is being hastily rattled out at the 11th hour) – but it shows some initial promise. I haven’t timed yet how well it performs against a gperf based keyword identifier, since I haven’t yet implemented the generator for functions that accept keyword candidates of a known length, but initial testing vs a textbook binary search of a sorted list, give some very promising results: Specifically, with the switch-case coming in at 10%-20% of the execution time of the binary_search.

So, writing code that writes code can pay off. And is fun.

Which is point of my story, I think. For now.