A few years ago, I started working on a project where through various technical decisions, a much larger portion of the team’s programmer would have to write pretty high-performance code. Now, as I’m sure most of you can imagine, this is easier said than done. Most large teams I know work the same way: most programmers write code that works and a few guys write the code that needs to be fast (including fixing other people’s code that needs to be fast and isn’t).
But what happens when suddenly, the portion of the code that needs to be fast grows tenfold? A hundredfold? Do your high-performance guys have the bandwidth to cope with the increase? Do they want to spend 100% of their time optimizing code they didn’t write/design?
What if you throw parallelism into the mix? SPUs? Suddenly, even “writing code that works” just got a lot more complicated, if people don’t know what they’re doing.
Here are some random thoughts I have about the process of injecting some high-performance mindset into the brain of programmers that have typically spent their time thinking about other types of problems.
What most programmers know about performance
They know scale. If they bump the number of raycasts a NPC does in frame, they know it’s going to be more expensive. They most likely don’t know by how much.
They know about the Big-O notation. This is the extent of what most university courses teach about performance. Big-O is good and bad, mostly bad. It’s good because it gives you a reference by which to think about the complexity of what you’re trying to accomplish. It’s bad because no one told them that wasn’t the whole story.
Day-to-day performance do’s and don’ts they will have gotten through osmosis from their colleagues. Stuff like making due with the squared length of a vector if you can.
They know about concurrent access, a.k.a. they know about locks.
What most programmers should know about performance and most likely don’t
Memory performance is where it’s at
This is really the heart of the battle when trying to educate someone about high-performance coding. There’s been a lot of discussion recently about Data-Oriented Design, and while it hasn’t exactly been unanimous, most people I see arguing against it are either trying to defend Object-Oriented Programming (which isn’t the point) or to be pragmatic about how not 100% of the code needs it (I believe it’s implicitly understood that any kind of programming requires you to use your brain). You will probably hear the same arguments from your programmers.
However, before DoD can even be discussed, memory performance needs to be understood. It’s quite a broad topic, so we decided to only cover data caches and completely ignored code caching issues. I believe thinking of your code in terms of L2 cache penalties is a damn good first step and you will often get a lot of code cache benefit without realizing it.
This is also the point where you’ll be going against Big-O. Sorting speed, insertion speed, search speed. What you will show to be fast will go against Big-O logic quite often.
A good example of this is found here.
Load-hit-stores a.k.a. LHSWTFBBQ
A big source of performance problems, especially for crappy in-order PowerPC CPUs. Most programmers have never heard of LHS or how code can cause them. Again, this can be an extremely complex topic, so we chose to ignore LHSs that are caused by general instruction flow and instead focused on LHSs caused by explicit or implicit casts that would cause your data to have to switch to another CPU through memory.
Explicit ones are easy to catch once you know what not to do. Don’t cast between float, int or vector types unless you absolutely have to do. And when you have to, try to not *need* that data right away.
Implicit ones are harder, e.g.
1 2 3 4 5 6 7 8
// insert code that results in myVector being in a VMX register
myVector(2) = 0.f; // flatten vector by poking a float right in the middle of it
// do more VMX stuff with myVector
I used to see this everywhere in our codebase (and others)! Argh. As it turns out however, there was also no way to do this efficiently in our vector class if you didn’t know how to do it manually (e.g. using our select intrinsics). So we needed a bit of education on why not to do it and also to add a few functions in our vector class so people could easily do it the right way.
Good resources to pass around: avoiding LHS using the restrict keyword.
How to do concurrent access right
Usually, people will know of locks. They will also have heard the engine guys bitch about them, but they don’t know any better really.
The first thing we did is cover parallelism in general. How it works, how data works, good practices and how awfully it’s going to blow up in their face if they don’t follow them.
Then we showed them two things that can help not use locks: atomic operations and lockless containers. That’s it. Atomics seem a little like black magic at first, but people get it after using them a few times. Lockless containers are useful and are also a good tutorial of concurrent access done right, if you show how they are implemented.
Tools that help
Most programmers will know how to use the in game profiler. Some will know how to use the in-house, offline version of the tool. Almost none can use 3rd party applications like PIX.
You must teach them.
Honestly, this part is easy if you just go with PIX (I wouldn’t recommend starting with anything else). People get the basic stuff right away. It gives them examples of what they are doing wrong right now.
Important! Make sure they have an easy way to get all the code they want to profile on the right core (i.e. the one PIX is profiling)!
Ah yes, SPUs. This is the part I feared the most. There’s a lot to cover just to get to “Hello World”. DMA, non-unified memory architecture, different types of CPUs, multiple executables at once. You will get a lot of blank stares. Not to mention the brick wall that is ProDG when it comes to SPU debugging.
This is the hardest step. It’s the one your programmers will dread. It’s also the one they will be the most enthusiastic about once they have their first successes.
It is also the step where they will really start to “get it”. When they use a DMA macro, they feel the memory access penalty. They see how much DMA they have to add for their crappy pointer chasing code. This is where putting effort into DoD starts to make sense to them.
Final words (aka my two cents)
My last point about SPU programming hinted at it, but to be clear, your programmers will need to write code for all of this to sink in. Very early in the process, one of the programmers learning all this asked me how I knew about it. The truth is I had no idea. I guess I know because at some point I *needed* to know. Performance is at the center of some fields more than others and graphics happens to be one of those fields. It wasn’t that I was a better programmer, I’d just had more performance problems than most that I needed to solve and someone (or google or hardware docs or something) was there to teach me. You have to create that need in your programmers (necessity is the mother of all inventions, they say).
It will be hard. It will be slower than if someone more experienced did it. Don’t handhold them. Help them if they ask questions and if they fall flat on their faces, but let them do all the work. They will come see you multiple times a day in the first week, it will be time-consuming. The second week a little bit less and less the week after, until more quickly then you probably thought possible, they will rarely need your help. Maybe for something new, something pretty tricky or maybe just to talk design, doesn’t that sound cool?
I’d be really happy to hear what other developers have to say on this topic. I don’t feel we’re succeeding 100% by far and I would love to hear what other people are trying, what works and what doesn’t. Please feel free to discuss in the comments, that is the whole point of this article. I’m also open to questions regarding anything that wasn’t clear or wasn’t discussed.