Comments on: Putting The ‘You’ Back Into Unicode Not trolling at all. UTF-8 can use up to 6 bytes for the highest characters. Right now, it would be rare to see a 5-byte character, though, you are correct in that. Consoles are just one facet of development. For them, where space is paramount, yes, it is a big deal, but then again consoles have always had that restriction in comparison to PCs. Thus why I started with "PCs have ..." Templates generate code for each type involved. If you have one set of library routines for UTF-32, it is likely to be the same or less code that using one or two different types in a template. I'm not saying what you've done is wrong, in fact, I applaud that you've built, tested, and released in public domain these functions. I use UTF-8 myself because authoring UTF-32 is basically annoying as hell. I'm just attacking the assumption that everyone has that UTF-32 is such a gigantic waste of space when virtually everything about modern computers is a waste of space. It's like crying about leaving the water on while you brush your teeth, but taking 2 hour showers. Indeed both are bad, but hell, if you're going to be that leisurely, it seems rather pointless to complain about the second. Not trolling at all. UTF-8 can use up to 6 bytes for the highest characters. Right now, it would be rare to see a 5-byte character, though, you are correct in that.

Consoles are just one facet of development. For them, where space is paramount, yes, it is a big deal, but then again consoles have always had that restriction in comparison to PCs. Thus why I started with “PCs have …”

Templates generate code for each type involved. If you have one set of library routines for UTF-32, it is likely to be the same or less code that using one or two different types in a template.

I’m not saying what you’ve done is wrong, in fact, I applaud that you’ve built, tested, and released in public domain these functions. I use UTF-8 myself because authoring UTF-32 is basically annoying as hell. I’m just attacking the assumption that everyone has that UTF-32 is such a gigantic waste of space when virtually everything about modern computers is a waste of space. It’s like crying about leaving the water on while you brush your teeth, but taking 2 hour showers. Indeed both are bad, but hell, if you’re going to be that leisurely, it seems rather pointless to complain about the second.

]]>
By: Garett Bass/2011/03/30/putting-the-you-back-into-unicode/#comment-2184 Garett Bass Thu, 31 Mar 2011 19:25:38 +0000 I think it is funny that we buy systems with a minimum of 2GB of RAM now, usually 4GB-8GB, and we're still stressing about how UTF-32 uses 32 bits (!) to encode a single character rather than a variable length UTF-8 which is 8 to 48 bits. Let's consider that a shortish novel is about 1.2MB of ASCII. Clearly if you game has this much text, it must be quite involved. Now, let's say that instead of UTF-8, you used UTF-32. You've jumped to a whopping 4.8MB of text! Really guys? Why doesn't anyone complain that Win32 API "BOOL" type is 4 bytes (typedef int BOOL) rather than a single byte ("typedef char BOOL")? Oh right, our CPUs are optimized for handling structures that are multiple of the native CPU word size and 3 bytes wasted is much better than unaligned data. I'd say that the real reason that UTF-32 isn't used as often is more related to network/Internet transmission (UTF-8 then is just another compression technique), authoring tools (how many tools allow you to edit UTF-32 text and/or resources? Spellcheck them?), and finally lack of standardized, built-in, cross-platform library support (Win32 uses 16-bit chars in a wierd psuedo-UTF-16, UNIX [generally] uses UTF-32 -- end result: wchar_t is effectively overloaded). Then again, maybe people are really concerned with saving their precious bytes...in which case they won't use C++ templates at all and compile for code size. "gcc -Os" anyone? I think it is funny that we buy systems with a minimum of 2GB of RAM now, usually 4GB-8GB, and we’re still stressing about how UTF-32 uses 32 bits (!) to encode a single character rather than a variable length UTF-8 which is 8 to 48 bits. Let’s consider that a shortish novel is about 1.2MB of ASCII. Clearly if you game has this much text, it must be quite involved. Now, let’s say that instead of UTF-8, you used UTF-32. You’ve jumped to a whopping 4.8MB of text!

Really guys? Why doesn’t anyone complain that Win32 API “BOOL” type is 4 bytes (typedef int BOOL) rather than a single byte (“typedef char BOOL”)? Oh right, our CPUs are optimized for handling structures that are multiple of the native CPU word size and 3 bytes wasted is much better than unaligned data.

I’d say that the real reason that UTF-32 isn’t used as often is more related to network/Internet transmission (UTF-8 then is just another compression technique), authoring tools (how many tools allow you to edit UTF-32 text and/or resources? Spellcheck them?), and finally lack of standardized, built-in, cross-platform library support (Win32 uses 16-bit chars in a wierd psuedo-UTF-16, UNIX [generally] uses UTF-32 — end result: wchar_t is effectively overloaded). Then again, maybe people are really concerned with saving their precious bytes…in which case they won’t use C++ templates at all and compile for code size. “gcc -Os” anyone?

]]>
By: Garett Bass/2011/03/30/putting-the-you-back-into-unicode/#comment-2147 Garett Bass Wed, 30 Mar 2011 17:19:46 +0000 I prefer to keep the data as UTF-8 both in memory and on disk. I find it conceptually simpler to just <em>know</em> that all strings everywhere are UTF-8. And it saves memory. It is true that conversion to code points is a bit slower, but I don't think that matters. The string manipulation I do (JSON-parsing, sprintf, concatenation, template filling, etc) all work directly on UTF-8 strings without needing any special processing to handle multibyte characters. The only time I need to convert to code points is when I render the strings on screen (and then I do many other slow things -- code point conversion is hardly the bottleneck). In fact, I there are only three calls to <em>utf8_decode()</em> in the entire BitSquid engine (all in the text drawer). I find it hard to think of other cases where I would need to access the string codepoint by codepoint. The only one I can think of is truncation -- truncating a name to 20 glyphs for display or doing a telex-like effect, where a message appears glyph by glyph. But those are very special cases. I'd rather have them be slow than use UTF-16 everywhere. I prefer to keep the data as UTF-8 both in memory and on disk. I find it conceptually simpler to just know that all strings everywhere are UTF-8. And it saves memory.

It is true that conversion to code points is a bit slower, but I don’t think that matters. The string manipulation I do (JSON-parsing, sprintf, concatenation, template filling, etc) all work directly on UTF-8 strings without needing any special processing to handle multibyte characters. The only time I need to convert to code points is when I render the strings on screen (and then I do many other slow things — code point conversion is hardly the bottleneck). In fact, I there are only three calls to utf8_decode() in the entire BitSquid engine (all in the text drawer).

I find it hard to think of other cases where I would need to access the string codepoint by codepoint. The only one I can think of is truncation — truncating a name to 20 glyphs for display or doing a telex-like effect, where a message appears glyph by glyph. But those are very special cases. I’d rather have them be slow than use UTF-16 everywhere.

]]>
By: martinsm/2011/03/30/putting-the-you-back-into-unicode/#comment-2128 martinsm Wed, 30 Mar 2011 09:08:26 +0000
It includes also converting from and to utf16 and utf32.

]]>