Data Schemas

Instapaper Text

Data Schemas

In this article I will write about data schemas and why they are useful. The plan is to start a small series of articles about our tool chain at keen games and how we get data into our games. This first post mostly describes some of the problems and introduces data schemas as part of our solution. I will go into more depth in later blog-entries.

Before diving into the specifics of the problem lets start at the beginning (This section might be boring for some of you veteran coders out there but I felt that it makes sense to describe the problem as holistic as possible).

All games need some kind of data (Hi Mike!) and most of this data follows some internal structure that is defined somewhere. For the sake of this post lets assume we are working on a game with some kind of hostile Entities in it. A typical entry of this data could look like this:

0x0018FEA4  00 00 40 40 00 00 80 3f 00 00 80 40 00 00 00 00 00 00 00 00 0a
 
  0x0018FEB9  00 00 00 0a 00 00 00 02 00 00 00

Without any additional information we see the data only as a stream of bytes without any structure or inherent semantic. Data only is useful with associated specification. The following C structure shows the possible data structure that matches the data above:

// note that float32 is a typedef for a 32bit floating point number and uint32 an unsigned integer with 32bits
 
  struct Enemy
 
  {
 
  	float32	positionX;
 
  	float32	positionY;
 
  	float32	collisionRadius;
 
  	float32	velocityX;
 
  	float32	velocityY;
 
  	uint32	hitPoints;
 
  	uint32	maxHitPoints;
 
  	uint32	damagePerHit;
 
  };

No, you shouldn’t use that as the foundation of your next AAA game, it’s just an example ;)

What did we do when we wrote down this C-structure? We basically *named* specific areas in memory and gave them a ‘type’ which makes it possible for the compiler to do type-checks and compute the required byte-offsets automatically. We could have defined the same structure in this way:

struct Enemy2
 
  {
 
  	uint8	 data[ 32u ];
 
  };

That structure can contain exactly the same data as the previous one but its a lot harder to access the fields. You can write code that uses this second data “structure” just fine, but you lost all help from the compiler doing so.

So far this all is pretty trivial but it shows that information about the structure of data makes it a lot easier to manipulate. Unfortunately the very helpful description in the first structure is trapped inside the compiler and we can’t easily use it outside of our written program (At least in languages without reflection support).

Getting Data into the Game

To go a step further we have to look at some ways to get the data required for the game into the place we need it (into the RAM of our target machine).
The first way would be to write something like this:

void spawnEnemy( Enemy* pNewEnemy, float startPositionX, float startPositionY )
 
  {
 
  	pNewEnemy->positionX = startPositionX;
 
  	pNewEnemy->positionY = startPositionY;
 
  	pNewEnemy->collisionRadius = 2.0f;
 
  	pNewEnemy->velocityX = 0.0f;
 
  	pNewEnemy->velocityY = 0.0f;
 
  	pNewEnemy->hitPoints = pNewEnemy->maxHitPoints = 100u;
 
  	pNewEnemy->damagePerHit = 10u;
 
  }

That works fine but has the major problem that every time you want to change anything in the data you have to change it in the code and build a new version of the game. That might work if you are just one person working on the game but it will slow you down when cooperating with more people and completely breaks apart in large teams.

To solve these issues we could have our data in an external file and just read the information about our enemy from there:

void spawnEnemy( Enemy* pNewEnemy, float startPositionX, float startPositionY )
 
  {
 
  	pNewEnemy->positionX = startPositionX;
 
  	pNewEnemy->positionY = startPositionY;
 
  	// note that you most likely *never* want to open a file synchronously in a function like spawnEnemy 
 
  	// (or use the CRT for files in the first place;) ... but it's just an example
 
  	FILE* pFile = fopen( "enemy.def", "rb" );
 
  	// error checking omitted because it would become distractingly long otherwise..
 
  	fread( &pNewEnemy->collisionRadius, sizeof( pNewEnemy->collisionRadius ), 1u, pFile );
 
  	fread( &pNewEnemy->velocityX, sizeof( pNewEnemy->velocityX ), 1u, pFile );
 
  	fread( &pNewEnemy->velocityY, sizeof( pNewEnemy->velocityY ), 1u, pFile );
 
  	fread( &pNewEnemy->hitPoints, sizeof( pNewEnemy->hitPoints ), 1u, pFile );
 
  	fread( &pNewEnemy->maxHitPoints, sizeof( pNewEnemy->maxHitPoints ), 1u, pFile );
 
  	fread( &pNewEnemy->damagePerHit, sizeof( pNewEnemy->damagePerHit ), 1u, pFile );
 
  	fclose( pFile );
 
  }

OK, everything is fine now, the designer can change all the values without the coders getting involved. BUT that’s only part of the story. We now have at least two new problems:

Endianness and alignment: The same value is represented differently depending on the platform (it would be too easy without things like that! ;) ) As some of the current target platforms (xbox360,ps3) are big endian machines and we use PCs as our development computers (which use the correct (little) endian we have to convert it somewhere)
Structural changes: We still have to change the code when we add/delete a field or just change the type of an existing field.

In the first example the compiler automatically adjusted the endianness of all literal values during compilation because it knew the types and the target platform. With the second approach we lost that because the actual data is invisible to the compiler. So we have to do the conversion ourselves.

We could store the data in a well known endianness and alignment into the file and convert the data during loading (if necessary). This results in code like this:

void readFloat32( float32* pTarget, FILE* pFile )
 
  {
 
  	fread( pTarget, sizeof( float32 ), 1u, pFile );
 
  	// we know that we store the values in little-endian:
 
  	#ifdef TARGET_IS_BIG_ENDIAN
 
  	swapEndianness32( pTarget );	//< swaps the endianness of a 32 bit value
 
  	#endif
 
  }
 
  // .. more code for different data types
 
  void spawnEnemy( Enemy* pNewEnemy, float startPositionX, float startPositionY )
 
  {
 
  	pNewEnemy->positionX = startPositionX;
 
  	pNewEnemy->positionY = startPositionY;
 
   
 
  	FILE* pFile = fopen( "enemy.def", "rb" );
 
  	readFloat32( &pNewEnemy->collisionRadius, pFile );
 
  	readFloat32( &pNewEnemy->velocityX, pFile );
 
  	readFloat32( &pNewEnemy->velocityY, pFile );
 
  	readUint32( &pNewEnemy->hitPoints, pFile );
 
  	readUint32( &pNewEnemy->maxHitPoints, pFile );
 
  	readUint32( &pNewEnemy->damagePerHit, pFile );
 
  	fclose( pFile );
 
  }

Note that you can put lots of C++-syntactic sugar on top of this solution to get something “nicer” (for some definition of “nice”) but that doesn’t change the solution at all. You still have to do something for each field of a structure which means the structural change problem remains:

Whenever we change the structure of the data we have to change the code and do it correctly in the right places because we otherwise misinterpret data which will lead to strange results.

Additionally we haven’t said anything yet about the creation of the file that contains the data. In the example above it is a binary file and these files are directly tied to the data structures in our code. Every time we change the structure of our data *all* the data files containing instances of that structure will become invalid. That is a serious problem because it increases the cost of doing structural changes by *a lot*.

Instead of converting the data during the reading we could try to do that in the asset pipeline and have a special preprocessed version of the file for each platform. The reading code will look something like the following:

void spawnEnemy( Enemy* pNewEnemy, float startPositionX, float startPositionY )
 
  {
 
  	FILE* pFile = fopen( "enemy.def", "rb" );
 
  	fread( pNewEnemy, sizeof( Enemy ), 1u, pFile );
 
  	fclose( pFile );
 
  	// note that the data file has to contain dummy values for the positionX and positionY members and we have to write to them *after* the fread() obviously. 
 
  	// So the data file is a bit bigger. This is another argument for separating data depending on life-time and use instead of an abstract idea of a 'class'
 
  	pNewEnemy->positionX = startPositionX;
 
  	pNewEnemy->positionY = startPositionY;
 
  }

Thats a lot better because we don’t have to change the code now when we add a member. We still have to recreate the data files though. So how would we do that? We could have a source data file that contains a higher-level representation of the data and a program that converts this high-level representation into the platform-specific target format:

We could for example use JSON as our high-level data file:

{ "Enemy": { "collisionRadius" : "2.0", "velocityX" : "0.0", "velocityY" : "0.0", "hitPoints" : "100", "maxHitPoints" : "100", "damagePerHit" : "10" } }

and have the following code that converts the data:

void writeEnemy( FILE* pFile, const JSonData& data )
 
  {
 
  	// the writeXXX functions know the target platforma and automatically convert the data to the correct endianness
 
  	writeFloat32( pFile, getFloat32FromJson( data, "collisionRadius" ) );
 
  	writeFloat32( pFile, getFloat32FromJson( data, "velocityX" ) );
 
  	writeFloat32( pFile, getFloat32FromJson( data, "velocityY" ) );
 
  	writeUint32( pFile, getFloat32FromJson( data, "hitPoints" ) );
 
  	writeUint32( pFile, getFloat32FromJson( data, "maxHitPoints" ) );
 
  	writeUint32( pFile, getFloat32FromJson( data, "damagePerHit" ) );
 
  }

Great! We now have very easy loading code that doesn’t need to be changed when we change the Enemy data structure. But we didn’t really solve the problem! We just added another layer between the source data and the game: The conversion program. This program *still* has to be modified when the data structure in the game changes.

To really solve the problem we need information about the structure of the target data during the conversion-process! And we need it in a way that is not hard-coded but can automatically adapt to changes in the data structure. We basically need exactly the information the compiler has on our structures. We have several ways to achieve that:

Parse the source code structure with our own parser. This is a really neat idea. We already defined the structure of our data and duplicating code is always bad. So why don’t we use the structure definition at hand? We would have to write a parser for our programming language (mostly C/C++ nowadays) and we are set. Together with the information about the target platform we know the exact memory layout beforehand. Unfortunately there are some downsides to this approach:
- We have to write a parser for C++ (C is really easy in comparison and if you only use C in your game you could easily do that). One option would be to buy one or use clang or something like that.
- We are limited to what the original language can express. There is no easy way to provide additional information about data types because the original language just doesn’t have those features.
- The third problem is that we now created a direct dependency from our source code to the asset-pipeline. This means that the asset-pipeline (the convert-function above) has to have access to the current version of the source-code. That might not be a problem in your case but in our company we use different version control systems for assets and code and this point was a problem for us.
Instead of directly parsing the code we could parse special ‘tags’ in comments which only solves the first two problems of the first approach and requires duplication of information.
The third way I thought about and the way I can recommend is to have a separate language to define our data structures and create the C/C++ code for them automatically in a pre-build step. This has several nice features:
- You can keep your language as easy and powerful as you need it to be and the parser should be relatively easy to write
- You only define your data once
- The language can contain all information that you want
There are some problems though:
- You have to have a build system that supports integration of this creation of C++ code before the actual compilation. That should not be a problem with any build system (even make can do that!)
- It can feel strange at start to not directly write data structures in C anymore. You obviously only need this system for data structures that should be visible to the outside and not for internal stuff, but you still have to get used to that. What helped here in our company was to use a syntax that resembles C-structs as close as possible.

Data Schema Language

Now that we have such a system we can define the data structure in our data schema language:

struct Enemy
 
  {
 
  	float32	positionX	 << defaultValue=0.0, hidden=true >>;
 
  	float32	positionY	 << defaultValue=0.0, hidden=true >>;
 
  	float32	collisionRadius	<< defaultValue=10.0 >>;
 
  	float32	velocityX	 << defaultValue=0.0 >>;
 
  	float32	velocityY	 << defaultValue=0.0 >>;
 
  	uint32	hitPoints	 << defaultValue=100 >>;
 
  	uint32	maxHitPoints	<< defaultValue=100, minValue=1, maxValue=1000 >>;
 
  	uint32	damagePerHit	<< defaultValue=10 >>;
 
  };

This looks nearly exactly like the original C structure except the additional annotations for the fields. These are mostly useful for tools/editors for this data.

With this data-schema we can fire up our compiler that parses this “program” and writes out a C Header file containing the original C structure. Additionally we can use this data-schema for the following:

We can write generic code that converts a JSON/xml data file with a given data-schema type into the platform specific binary file. That is *huge* because it allows us to get data into the game without a single changed line of code.
Given the data-schema we are able to write generic editing code that creates widgets from the data-schema and allows editing of types without a single line of editing-code for the specific data structure.
Given a void* to a piece of memory, the data-schema we can write code that dumps data in a structured way. This can be a very useful debugging-tool. But it can be used to create serialization-code (for save-games or network-replication) as well
We could embed the data-schema into the binary files which would make them self-describing like xml. That is really useful because you can always look into the files and get a structured view on the data instead of a hex-view. Additionally you can check that the binary-file is compatible with your current game when you compare a checksum of the data-schema.
- 100. … use you imagination ;)

I hope I could convince you that this is an interesting solution to a problem that most game-companies have. I saw lots of ad-hoc/manual solutions in the past years and was always quite unhappy with them (because they required lots of manual synchronization of code and data and broke easily).
We are very happy with this solution so far using it for roughly a year now and have lots of ideas to extend it even further. I would be really pleased to hear your solution to this problem and what you think of ours.

If requested I can go into more depth for some specifics of our implementation or write on something completely different in the future (Please comment if you want to know more about something specific).

#AltDevBlogADay

Julien Koenen

Data Schemas

Getting Data into the Game

Data Schema Language