Don’t worry, just parallelize everything and have it run over all the cores.

Ouch. This is going to hurt. Invariably it does; when you end up having that really hard to find heap corruption issue, or when those odd pauses creep into the profiler when calling into that middleware. Parallel execution is tough at the best of times, and I think it is harder in gamecode where the notion of a game-frame is key. Something I have learned over the years looking at codebases is that many developers do very little groundwork in preparation for parallel operation. Laying some clear code foundations and keeping to a small set of rules can result in less pain encountered in the long term.  In this post, I’m going to share the main tips/rules I try to abide by when doing that job; something I find myself doing often to increase the prospect of allowing more parallelism.

These tips allow designing systems for parallel execution simpler, but it is still not easy street. You need deep knowledge of the systems you are working with, the data it touches, and above all understand what makes things inherently serial. Also you need to understand how you are going to achieve the parallelism; are you using multiple threads running over multiple jobs, running a large subsystem for the most part of a frame, or are you going to use the SPUs?

Tip 1: Know your critical path

It is essential you know what the critical path is for a given frame. When we are talking parallel it will ensure the path is not simple; it will branch off, swap threads, even move over a hardware architecture boundary (PPU->SPU->PPU/CPU->GPU). I know stating this may seem obvious to people, but I have worked with codebases that don’t even have the simplest of things; for example markers in the profiles around the major systems on CPU. Knowing the critical path and being able to understand it easily will save you time when trying to figure out what that massive wait is halfway into the frame, and how it is impacting upon all the other game subsystems. Without knowing the critical path, trying to parallelise will ensure developers waste time by stabbing around in the dark.

Tip 2: Move Entity/Component/Thing systems onto Read, Execute and Write Phases

Data races: the big bad boss of parallel computation – are caused mainly by side-effects. That is, given a block of code, side-effects are the possible changes that can occur to data outwith that context. So if you have an entity whose sole job is to impose a force onto the player spacecraft when near gravity, the side-effect there would be the force which is applied to the spacecraft. Internally, this entity will have written some internal state – writes to this data is okay. Internal state like this is private, as it were. Global state side-effects can cause race conditions, corrupt data, and generally inflict a mountain of pain.

Due to the fact that side-effects cause the most problems, it makes a lot of sense to separate the Writing of that data into its own Phase.  The gist is that you run the update of the entity at any time, but you can run the Write phase in a controlled manner. In very many cases, the Write phase can be small, allowing you to run it in serial on the main CPU if the order is essential. Running one section, or simply a few entities of a section in serial may allow much larger Update Phases to have parallel potential, and the developer must take that on board when making that call.

I would also add a Read Phase, because being able to prefetch some data at a certain defined point during execution allows for the widest possible uses. For example, if you collect all the gravity forces in the Read Phase, you then have more options available when scheduling the Update Phase, as you are not bound by running it at a time when global state is correct (I will explain what correct means in this context later). It will also mean if you need to put this code onto SPU, you could run the Read Phase on PPU, which may be of use if the data is not contiguous, involving lots of pointer chasing to get at the meat of the system. You can then send only what the Read Phase needed to the Update and Write Phases, which could live happily on SPU.

Obviously, both Read and Write Phases have catches, you need extra memory which can be pretty sparse at the best of time to store the cached Read/Write Phase data, and of course sometimes it is impractical to get all data you need in the Read Phase. In the same way, sometimes doing all your Writes only in the Write Phase can be an issue. The main point of this is to allow greater scope for parallelism. You have more options when it comes to scheduling the work, as in practise, dependencies can be as simple as a single read or write which then impacts on a massive amount of Update work. Being able to control the reads and writes will ultimately allow for more parallelism by being able to schedule many different items of the same Phase at once.

Tip 3: Incorrectness

This is something I enjoy planning for, as it allows for various parallelization opportunities, but it can get incredibly complex. It follows on directly from Tip 2. As I mentioned, it can be completely impractical to go reading everything in the Read Phase, and the same goes for the Write Phase.

What? So basically what you are saying is having Read Phases and Write Phases to control the movement of side effects is impractical?

Yes. Unless, of course, you plan your code in such a way that it can deal with a read of incorrect data. Invalid data here would be the really bad stuff; invalid data is data which could cause the system to have undefined behaviour, such as a bad pointer dereference. Incorrect data on the other hand is perfectly valid for use. It may be that it is a frame or two out of sync, but the system will accept it, process it, and not fall over. Reading of Incorrect data may happily be done in the Update Phase of our system, as it will not cause serious problems with regard to when the read occurs. For example: it may cause a particle instance to remain active a frame longer than it should. However, allowing for this and designing your code to accept this situation can open up a massive amount of parallelism.

Generally, if you can cope with the fact that things may not be completely updated to the real version of things every frame, it means you can lower the reliance on locks around shared data. It is quite rare that data structure changes many times during a frame in my experience, and that is the type of operation we would ensure happened in the Write phase.  Ensuring the changes to data structure (which can cause invalid reads later on) are done in the Write Phase means we can keep Update Phase parallelism, but keep the dependency on that Write, allowing it to happen later in time.

Again, if there are systems which require some data be correct in the Update Phase, you would extract it out into the Read Phase. Then you can have the dependency only on the set of writes from previous entities, again keeping Update Phase parallelism. Any writes which modify data structure should be extracted into the Write Phase.

Tips two and three are big ones, so to sum up: If you are trying to have an entity or component system run in parallel, try to organise it into Read-Update-Write Phases, and plan for the instance when data can be incorrect. This allows for control of side-effects; control of the reading of data which must be correct, changing of data structure in the Write Phase, and arguably an easier route onto SPU. There are some big gotchas here, and you need to take in both the pros and cons, but in my view the added issue with dealing with incorrect data can allow for a sizable parallelism boost.

Tip 4: Middleware

There is nothing more frustrating than getting the components that have been causing you trouble with frame time parallelized, and its running slightly slower. You load up the profiler and those ‘Thread Safe’ middleware calls you did are actually just entering a global critical section. Sections of your main thread, especially on PS3 with the PPU being the most valuable resource, may be sitting there doing absolutely nothing.

This sort of situation creeps up from time to time, and really you just need to accept what the middleware does, and try to think up a way of changing your code to make what the middleware is doing insignificant in terms of the critical path.

As an example, the last time I had to do this was parallel adding and removing of objects from the physics system we were using. This operation could occur any time, and if this happened during the main physics update step the operation would block until it had completed. I added a flag to the object to say if there was a pending operation, and when I wanted the object added/removed just set that flag appropriately and the rest untouched. Then in the Write Phase of the object, which I could control the execution of,  I ran through the flags and applied the middleware operations as necessary. This is also a good example of incorrect data, as another entity may have checked and seen it was still active in the physics, and done something with it. If you end up doing this, you need to think beyond frames and that can add many complex corner cases. In my view the added complexity was worth it, as it allowed for minimal impact on the critical path, keeping parallelism potential.

Takeaway

Many will have the view that these tips have a lot of hidden catches. I have tried to make the point that following these still leaves a lot of gotchas for the developer to sort out themselves. Getting lots of parallelism is a massive challenge, involve annoying days at the debugger tracking down that corruption, but I do think these tips can be of use to people. They are the basis for some of my recent work which involves execution of systems which span game frames interacting with vertex/index buffers, renderer properties, physics system attributes, gameplay mechanics and even a simple garbage collector.  All without a single lock on data, and can also run on SPU. For me, these tips are valuable – and would really appreciate any feedback/comments about them or your own experiences with preparing for parallelism.

 

@domipheus.