Comments on: Strings Redux The basis can be something like this.

Note that you shouldn’t hash all the strings, only the ones used as identifiers/names.

]]>
By: Zachary Hoefler/2011/06/10/strings-redux/#comment-5652 Zachary Hoefler Wed, 15 Jun 2011 01:48:00 +0000 We use murmurhash.

]]>
By: moo/2011/06/10/strings-redux/#comment-5611 moo Mon, 13 Jun 2011 18:01:36 +0000 Anecdotally, I've seen collisions in 32-bit CRC values for names of shaders (I can't remember the total number of shader names we were hashing, but it was a bit more than ten thousand). In general, if you are going to have more than a few thousand things to hash, 32-bit hashes start to get a bit risky. Its not a big deal if you can detect collisions very early (before the offending assets are submitted into the version control system!) and can easily rename one of them to resolve collisions. If the names are auto-generated following some kind of pattern, you might have to modify the pattern instead. In the shader collision case mentioned above, I think we renamed a folder which was part of all of the names being hashed... @Mike Nicolella: I haven't used a central server for hash values, but I have worked with engines where assets are assigned unique IDs from a 32- or 64-bit key space and a central server hands out blocks of keys to the users' machines. Even with 32-bit keys this works fine, but we experienced various "leaks" due to the design or implementation of our system -- features that would inadvertently cause large numbers of keys to get wasted over time as assets were changed. So with 32-bit keys you have to either be careful about that, or be capable of globally re-numbering all of the assets that have had a key assigned to them (or preferably both). With 64-bit keys you have so many of them available that you can afford to waste plenty of keys and just not worry about it. Anecdotally, I’ve seen collisions in 32-bit CRC values for names of shaders (I can’t remember the total number of shader names we were hashing, but it was a bit more than ten thousand).

In general, if you are going to have more than a few thousand things to hash, 32-bit hashes start to get a bit risky. Its not a big deal if you can detect collisions very early (before the offending assets are submitted into the version control system!) and can easily rename one of them to resolve collisions. If the names are auto-generated following some kind of pattern, you might have to modify the pattern instead. In the shader collision case mentioned above, I think we renamed a folder which was part of all of the names being hashed…

@Mike Nicolella:
I haven’t used a central server for hash values, but I have worked with engines where assets are assigned unique IDs from a 32- or 64-bit key space and a central server hands out blocks of keys to the users’ machines. Even with 32-bit keys this works fine, but we experienced various “leaks” due to the design or implementation of our system — features that would inadvertently cause large numbers of keys to get wasted over time as assets were changed. So with 32-bit keys you have to either be careful about that, or be capable of globally re-numbering all of the assets that have had a key assigned to them (or preferably both). With 64-bit keys you have so many of them available that you can afford to waste plenty of keys and just not worry about it.

]]>
By: Ben Sizer/2011/06/10/strings-redux/#comment-5523 Ben Sizer Sat, 11 Jun 2011 12:07:35 +0000 What hash function do you use for generating your 32-bit and 64-bit keys? What hash function do you use for generating your 32-bit and 64-bit keys?

]]>
By: Anton Shekhovtsov/2011/06/10/strings-redux/#comment-5482 Anton Shekhovtsov Fri, 10 Jun 2011 21:08:14 +0000 I don’t think the standard requires std::string::append() to have amortized linear growth, though I confess I haven’t read the standard in detail, just some posts about it, such as this one. The Dinkumware implementation has though (growth factor 1.5).

My dislike of string classes is not so much based on what any particular string class does though, but rather their viral nature. They have a tendency of infecting APIs and code everywhere, and being used for all kinds of inappropriate purposes (static strings) while not really DOING anything. I mean there is nothing there. Once you scale away the 10 000 lines of template code a string class is just a char buffer with an allocation policy. I prefer to be explicit about that. To choose what allocator and growth strategy I want to use.

The “small string optimization” can be useful, but again I think it is useful mostly because std::strings are overused. Does your program really have tons of small, but variable-sized, dynamic (not static) strings? Why?

]]>
By: Mike Nicolella/2011/06/10/strings-redux/#comment-5460 Mike Nicolella Fri, 10 Jun 2011 08:15:32 +0000 My experience with 32-bit hashes and collisions was frustrating, though going 64 to 32 did save a bunch of memory. Over the course of a project we had hundreds of collisions. There was only one global 'namespace', so everything could potentially collide with anything else, and hashed strings were used extensively (and rightly so!). Usually the collisions popped up in the nightly build, which meant there wasn't a good build in the morning. You need a process that auto-resolves conflicts, but the build still has to be re-started any time a collision is found, unless your pipeline and data formats are flexible enough to be able to go back and 'fix up' a hash after the fact. An intermediate "remap" table helps to not leave the source string (ie "jump_medium") compromised - to resolve a conflict, enter an explicit hash for the colliding string in the remap table, and that at least mitigates that. My experience with 32-bit hashes and collisions was frustrating, though going 64 to 32 did save a bunch of memory.

Over the course of a project we had hundreds of collisions. There was only one global ‘namespace’, so everything could potentially collide with anything else, and hashed strings were used extensively (and rightly so!).

Usually the collisions popped up in the nightly build, which meant there wasn’t a good build in the morning. You need a process that auto-resolves conflicts, but the build still has to be re-started any time a collision is found, unless your pipeline and data formats are flexible enough to be able to go back and ‘fix up’ a hash after the fact.

An intermediate “remap” table helps to not leave the source string (ie “jump_medium”) compromised – to resolve a conflict, enter an explicit hash for the colliding string in the remap table, and that at least mitigates that.

]]>
By: snake5/2011/06/10/strings-redux/#comment-5458 snake5 Fri, 10 Jun 2011 08:05:23 +0000 @Chris: It is debatable... but note that because of the "birthday paradox" the risk of collision is higher than you might think. When you start to approach tens of thousands of strings you can no longer neglect it. Also, it depends on how serious you think a collision is. I think it is pretty serious because I think names are pretty important. If an animator has made "jump_low", "jump_medium" and "jump_high"... it kind of sucks if she has to rename "jump_medium" to "jump_medium_asdfg". Add to that the fact that collisions will only be detected when assets are merged, which can be a long time after creation and that there can be a LOT of people creating data for the project. Then yes, I'm willing to spend the extra bits. @Chris: It is debatable… but note that because of the “birthday paradox” the risk of collision is higher than you might think. When you start to approach tens of thousands of strings you can no longer neglect it.

Also, it depends on how serious you think a collision is. I think it is pretty serious because I think names are pretty important. If an animator has made “jump_low”, “jump_medium” and “jump_high”… it kind of sucks if she has to rename “jump_medium” to “jump_medium_asdfg”.

Add to that the fact that collisions will only be detected when assets are merged, which can be a long time after creation and that there can be a LOT of people creating data for the project. Then yes, I’m willing to spend the extra bits.

]]>
By: David Neubelt/2011/06/10/strings-redux/#comment-5454 David Neubelt Fri, 10 Jun 2011 07:47:46 +0000 Thanks for the article! I generally strongly agree with everything you state, except one point: "You do not need a string class" I'll assume we are talking about C++ and std::string, since of course a custom class can solve whatever problems you'd like them to solve. I don't see the advantage to using a vector over std::string. Both containers have .size(), .capacity(), and .reserve(). std::string is a 'dynamic string' and grows much in the same way a vector does. The standard does say .reserve() may shrink the capacity (not true of vector), however your claim is that vector is clearer than std::string in stating that it's an amortized linear growth, but I disagree - both types state the same thing. The other advantage to having a string class over a vector is that it allows (and most implementations do implement) the "small string optimization" - that is, the class internally has an N (typically 16) character buffer it drags along with it, such that if the string you're storing is less than N characters, a dynamic allocation is completely avoided. That's not something vector does, and is handy enough to warrant having a separate string class to hide. Thanks for the article! I generally strongly agree with everything you state, except one point:

“You do not need a string class”

I’ll assume we are talking about C++ and std::string, since of course a custom class can solve whatever problems you’d like them to solve. I don’t see the advantage to using a vector over std::string. Both containers have .size(), .capacity(), and .reserve(). std::string is a ‘dynamic string’ and grows much in the same way a vector does. The standard does say .reserve() may shrink the capacity (not true of vector), however your claim is that vector is clearer than std::string in stating that it’s an amortized linear growth, but I disagree – both types state the same thing.

The other advantage to having a string class over a vector is that it allows (and most implementations do implement) the “small string optimization” – that is, the class internally has an N (typically 16) character buffer it drags along with it, such that if the string you’re storing is less than N characters, a dynamic allocation is completely avoided. That’s not something vector does, and is handy enough to warrant having a separate string class to hide.

]]>
By: Chris Waters/2011/06/10/strings-redux/#comment-5451 Chris Waters Fri, 10 Jun 2011 07:06:14 +0000