Localization: We need a standard format!

Instapaper Text

Localization: We need a standard format!

Localization, the old familiar concept that many development teams had to see face to face, sooner or later.

Whether you are a small indie developer or a big studio, if you want to reach a bigger audience, your game needs to be translated and localized. Of course, there are games that forego text for a more audiovisual experience (Flower for instance) and that’s perfectly fine but here we’ll focus on the ones that do have text, even when it’s only a small amount.

The last ten years I’ve been involved in localization tools development and talking to different translators and editors. When I think of that, a lot of options immediately come to mind: Excel sheets, Word Documents, .txt files, several custom applications and even mail clients.

After a while and since software development is undergoing a constant evolution, I found myself thinking, okay, if those are the only options, surely they can be reduced or managed to some extent. Everybody knows how to edit documents in those familiar applications, it really is not that big of a problem, right?

Turns out, the problem not only didn’t get any better but it has worsened. Let’s add to that list, more custom applications from localization companies, WordPress Plugins, Facebook translations, I’m not sure about Google+ but they probably have something too, resx files, a giant list of custom XML schemas, you name it.

This madness has to stop.

Problematic

First of all, when one talks about localization, the common subject is text files. But anyone who has been involved in that process knows that it actually implies a whole umbrella of elements. Text with Unicode support is a given (even though there is a plethora of issues there), then you have images with text on them, videos, audio files, local legal requirements and maybe content tailored for a local market, one culture may think one symbol is cool while others may think it resembles death incarnate.

Sure, you could say, don’t complicate things and just stamp an excel file or a document there, put an incremental number scheme on the file name and you are set. Whilst that could work for text, when you have images or audio involved, it just doesn’t cut it. Every format has its quirks and ways of testing it and it’s not like your QA team speaks 5 major languages and can corroborate whether the content is correct or not.

So you have all these digital assets, you send it to translators and localizators, whether it’s a big company or contractors. You get them back, you approve them for updating, editing takes place. Voila, somebody forgot about a word here or proof-reading there. You have to send it back to the translators, but now the deadline had to be pushed a little forth. Rinse, repeat.

Of course, there are already several localization initiatives (for instance, i18n, l10n, etc). However, it´s important to have something shoehorned to our industry.

But let’s be a bit more precise and summarize the problematic:

Inconsistency: Most tools don’t talk well to each other. They could reside on different platforms which you don’t have access to. Conversion must be made at several levels, lost in translation gets a whole new meaning here.
Inefficient: Great overhead in time and resources. Files need to be synchronized across multiple departments and teams, in many cases.
Follow-up nightmare: Text and localization need to be corrected all the time, be it proof-reading, typos, even cultural corrections. Not many tools are friendly enough for translators to be comfortable with them.
Error prone: All these moving parts leads to a loss of quality in terms of localization. It’s usually a part that gets underestimated and ends up with a lesser priority. The impact of a bad translation is not immediately obvious.
Expensive: A complex problem that needs a complex solution, sounds about right. That sounds expensive too, either you hire a third-party company or you have a full-grown localization team in-house. And if you are really small well, you probably do it yourself but then the amount of content you can pour is limited since you have to worry about other things too.

This definitely calls up for a major review. But given that many angles to tackle, where shall we begin?

The proposal

A sample file

In order to solve this complex problem, I would definitely go with a KISS approach. We need to go to the core of the issue. We may not end up having a perfect solution but at least a solid foundation to build upon.

So let’s picture a common denominator. We can reduce every part to a localizeable element, a unit of localization if you will. Every unit can then have a type, which could be Text, Image, Video or anything we consider necessary. For the sake of argument, let’s call it, a Symloc (A portmanteau of Symbol and localization).

Let´s take that up a notch and formulate a basic list of requirements:

Unit of localization: A Symloc is a single atomic element that can be localized. Meaning it can be translated or modified and corresponds to any given language. It may contain the data directly or just a reference (in case of a video that weights a gigabyte).
Metadata: Any Symloc can have important information related to the asset. It can be used as an aid for translators and content editors or it can be used by another program. Examples of these are preview information, notes, the size of an image or text, the length of an audio file, etc.
Grouping: Symlocs can be grouped together to have different languages into one asset or to have related assets merged together.
Support for every language out there: This one is a no-brainer, Unicode support from the ground up, since we don’t have any size restrictions, no one is left behind.
Play nice with other standards: There is a lot of research poured into localization and internationalization standards, possibly use those for specific things like currency, date and time, telephone numbers and any other term that may be culturally different.
XML as a foundation: XML is flexible enough to be used for any format. Official schemas can then be used to validate a proper localized document.
A true standard: created as an open initiative, no company should control this standard to avoid conflicting interests.

Naturally, not everything that shines is going to be gold so we surely have a couple of drawbacks that should be anticipated.

Pitfalls

First of all, there is a little overhead for small texts. Not to worry, Symlocs can be grouped. So if you have a swarm of small texts, you could have them all under the same category (which is simply another metadata element).

Second, every developer has a final format and that probably differs from others. This is where this solution is not perfect, note that the format is intermediate (in that regard, it’s similar to COLLADA) where it acts as a common gateway for localizeable digital assets. In this regard, companies can build their own scripts and tools to store the information in the way they see fit, generally for performance gains.

Third, as any new idea for a standard format, there is no way it’s going to pick up steam if nobody uses it. In this case, we simply need a lead by example approach, if the standard is good and provides some real benefits, others will follow. Otherwise, it will be submerged into oblivion.

Let’s conclude

Who would benefit because of a standardized localization format for games? The answer is simple, everyone.

Game development companies could then worry about having their own conversion tools instead of having yet another translation system. Localization companies could deliver their translated assets easily knowing that if they stick to a standard, the final result is going to work as expected.

Smaller teams can use other’s people tools, even open source ones, to have their games translated to as many languages and cultures as possible, effectively augmenting their reach. They can be be sure that volunteers can step in and send back game assets in a format that can be easily inserted back into the game.

Translators can rest assured that the tools used for translation are becoming standardized and not worry about learning yet another obscure system. Again, effectively becoming more efficient by delivering more assets in less time.

Now, we just need somebody big out there to believe in this idea and make an example out of it.

Meanwhile, I’m going back to design my own tool for this format.

#AltDevBlogADay

Nicolas Lamanna