So the recent posts on constants by Jonathan Adamczewski inspired me to share some of my experience on the subject of constants. Specifically getting floating point constants into SIMD registers.
Now I know we should strive for as few constants and magic values in our code (and whether they are defines, static const variables, or just naked numbers they are all trouble). But they are not going away and they have their place. Since making the switch from graphics to AI programming the number of constants that appear in my code has climbed by an order of magnitude and so this subject is certainly more important to me that it used to be. And what makes it important is performance, which for many of us really means console performance.
If you’re concerned about performance on the consoles you’re going to be doing as much math as you can in SIMD and passing values between the SIMD unit and the FPU as little as possible. Why, you might ask? The dreaded Load-Hit-Store! Every time you want to transfer a number from a floating point register to the vector unit you need to write it out to memory and read it back in. This will kill your performance if you’re littering your algorithms with a mix of SIMD and floating point calculations that depend on each other. So generally speaking you want to be doing all your calculations in one unit or the other. Assuming you are doing most of your math on the vector unit how do you get those numbers over there? Well if you already have the value in the FPU you obviously just need to bite the bullet and pay the cost to move it over.
But what if it’s in memory? For floating point constants in memory (basically any naked floating point value or static const variable in your code) you need to first load it into the FPU and then pay an LHS to transfer it to the SIMD registers. Fugly!
It would be much better if we could load it directly into the SIMD unit – and you can! You just need to define the constant such that it is a type that the SIMD system understands – basically a 128 bit aligned value that it can load directly into the register in question. It probably doesn’t even have to be aligned if you’re clever about how you load it. Now of course you pay some a price in memory used if you have a lot of static const variables around, and there is a risk of an occasional unfortunate cache miss. But generally speaking it a decent practice to just load the value from memory if you need a very specific full float precision number.
But wouldn’t you like to create them in place? We’re used to only being able to construct “special” floating point values in code – basically one or zero. But there are SIMD instructions on the PowerPC in your 360 and PS3 that can make other numbers too!
Now unfortunately we still can’t construct arbitrary floating point values, but we can compute values using the following function
value = p * 2 ^ -q
Where p and q are immediate integers such that -16<=p<=15 and 0<=q<32. So if you want floating point 2 you use p=2 and q=0, if you want 0.5 you use p=1 and q=1.
If you’re interested in the specifics the assembly instruction in question is vec_ctf
So this gives us a bunch more "special values" (rather than the regular 0 and 1 available in the FPU instruction set). But what if I wanted 0.75? I can’t compute that using the formula above. But I can just add 0.5 and 0.25 (p=1, q=2) together to get my value. I could even just write a #define to do that for me. And adding a few of these "immediate floats" together is still far faster than loading the from memory or even worse loading them to the FPU and then transferring them to the SIMD registers.
But what about arbitrary constants? 0.9 for example.
The nice thing about these kinds of constants is that the acceptable error on these numbers is relatively large. If you’re using a 0.9 constant you probably mean 0.9ish. Now I can’t think of any two numbers of the form described above that add up to 0.9. But my computer certainly can! Basically you can write a simple little computer program that generates all possible combinations of a certain number of these special values and picks the best ones that when summed come closest to the supplied constant. For most constants I’ve tried two terms is enough. For instance 0.9 is (p=7,q=3) + (p=13,q=9). This results in an error of 0.000391. If this is acceptable then you win! But sometimes two terms isn’t enough.
Brute forcing three terms is near instantaneous and is generally sufficient for the vast majority of constants you need. If you need four terms you can run the brute force algorithm over night and check the value in the morning. Or if you don’t want to wait or you need 5 terms because 4 left some error then you’ll need a faster algorithm (it should be noted that I’ve yet to find a number that required 5 terms).
One faster method you can use is to first get as close as you can with one term and then use the difference from your goal as the new goal for your next term. This algorithm has complexity linear in the number of terms and will generally give you a perfect result in 5 terms. But many would start to doubt whether a 5 term number is faster to construct on average than just loading the constant from memory.
One of the few constants I’ve felt the need to have at full float precision is PI (though having a PI_APPROX of two terms is sufficient much of the time). Computing it first with the greedy efficient algorithm required 5 terms for full precision
(13, 2) + (-7, 6) + (8, 13) + (-9, 20) + (-11, 25) = 3.14159265
However the brute force method yielded a 4 term solution
(3, 0) + (5, 5) + (-9, 20) + (-15, 10) = 3.14159265