[vect-i-kit, -ket]
1. the code of ethical behavior regarding professional practice or action among programmers in their dealings with vector hardware: vector etiquette.
2. a prescribed or accepted code of usage in matters of SIMD programming, or a set of formal rules observed by programmers that actually care about performance.
3. How not to be a total douche and disrespect your friend the vector processor, who wants so badly to make your game faster.

A few months back I was talking to a friend who was doing his B.S. in Computer Science at a respectable school.  The conversation happened to drift towards SIMD.  My jaw hit the floor when he told me he had no idea what that was.  I gave him a basic rundown and you could see the excitement in his eyes.  I mean how could you not get excited? Even explained in the most basic oversimplified terms, the concept of doing 2, 4, 8, or 16 things at once instead of doing just one is universally appealing.  He went off all full of excitement and hope, ready to do some vector programming.  He came back one week later, beaten, dejected, and confused.  His vector code was orders of magnitude slower than his scalar code.

The problem isn’t just with college students.  We often get programming tests from professionals that are quite good in many ways, but that show a complete lack of understanding wrt good vector programming practices.  Mistakes tend to fall into two categories: lack of general vector knowledge and assuming that what works on one CPU is also best practice on a different CPU.

While there are certain good guidelines to follow, be aware that different things carry different penalties on various CPUs, and the only way to write correct code is to know the details of the hardware you are targeting, and the compiler you are using.  I know they probably told you in school not to code for quirks in a specific compiler, but by not doing so you miss out on tremendous opportunities (see techniques for splitting basic blocks in gcc, and rearranging conditionals to take advantage of gcc’s forward and backwards branch prediction assumptions as simple examples)

OK, let’s get started on the journey to efficiency! One of the biggest offenders in slow vector code is moving data in and out of vectors too much.  Often people calculate something into a bunch of scalar floats, move them into a vector, do a vector add, and then extract back to scalars. What these people don’t realize is that moving in and out of vectors is rarely free, and is often one of the most expensive things you can do.  The main problem lies in the fact that on most systems float registers and vector registers are two completely separate register sets.  More specifically, the problem is that to get from float registers to vector registers, you often first have to store four floats out to memory, and then read them back in to a vector register.  After your vector operation, you have to reverse the process to get the values back into scalars.  You basically took what could have been 4 consecutively issued adds (assuming your CPU/FPU has pipelined float operations, or a non-IEEE-compatible mode) and turned it into 4 scalar stores, 1 vector load, 1 vector add, one vector store, 4 scalar loads, and who knows how many stalls/hazards!  As Steven Tovey rightfully pointed out, if the alignment of the vector is bad, the number of vector loads could be 2, and a bunch of permutes and permute mask gen instructions. Awesome!  As a general rule, you don’t want to mix scalar and vector calculations, and if you do, make damn sure that you aren’t just doing one or two vector operations.  You have to do enough in vectorland to justify the cost of getting in and out of the registers.

Even if you are on a platform like NEON where the vector registers and float registers alias each other, you still have to be careful.  On NEON, switching between scalar and vector mode requires a pipeline flush on newer Cortexes, and that can be semi-costly.  The problem here is almost opposite of what we described before because instead of moving things in and out of vector registers and calling only vector instructions, you are keeping things in the same registers but mixing scalar and vector instructions.  If you are going from general purpose registers to NEON, its just as bad.  While unlike the PS3’s PPU which needs to go through memory, ARM<–>NEON actually has forwarding paths between register sets, but there is still an asymmetrical cost associated with the transfer.  Its just something to think about when you think you have a free pass to mix scalar and vector code.

Building whole vectors isn’t the only way to screw yourself.  Unfortunately, one of the most common things people do with vectors often results in horrific performance!  Take a look at this

// this makes baby altivec cry
some_vec.w = some_float;

See what I did there?  We are inserting a non-literal stored in a float register into a vector.  I don’t mean to sound idealistic but if you are wrapping built in vector types in structs, I think its best not to define functions for inserting/extracting scalars (depending on the CPU).  If they are there, people will use them.  The least you could do is name them something horrific like

inline void by_using_this_insert_x_function_I_the_undersigned_state_that_i_know_and_understand_the_costs_associated_with_said_action_and_take_full_responsibility_for_the_crappy_code_that_will_undoubtedly_result_from_my_selfishness_and_reckless_disregard_for_good_code( float x );

There, that ought to teach em a lesson!

There is a clever way to get around some of the above hassles, and its lovingly referred to as “float in vector.” The concept is simple enough.  Instead of using floats all over the place, you make a struct that acts like a float, but internally is a vector.  This lets you write code that looks like its a mix of vector and scalar, but it actually lives entirely in vector registers.  While some_vec * some_float could be disastrous in some cases, if some_float is secretly a vector, this will compile to a single vector multiply.  Hot tip: duplicate your scalar to all lanes of the float in vec’s internal vector, because it allows code like the previous example to work unaltered.

One last thing I want to quickly mention before moving on to code writing tricks.  Aside from the PS2 VUs, most vector units don’t have cross vector math operations (very useful for dot products).  Therefore while code like vec.x * vec.x + vec.y * vec.y + vec.z * vec.z can technically be done completely in vector registers, it takes a lot more work to move stuff around.  For a way around this, see point 7 below.

Giving GCC What It Wants

Another important point is to understand the personality of the compiler you are using.  Don’t take the attitude that the compiler should do something for you.  As a programmer, it is your job to help out the compiler as much as possible (best case) and not make the compiler’s job harder (worst case).  So, what does good vector code look like on GCC?  The list below is in now way exhaustive, but it contains a few semi useful tips that can make a big difference.  I’ll try reeeeeally hard to keep each item brief as to serve as a good introduction, but if you want more details feel free to ask me (or google).

1) If possible, use lots of const temporaries.  Storing the results of vector operations in lots of const temporaries helps GCC track the lifetime of things in more complex code, and therefore help the compiler keep stuff in registers.

2) If a type fits in a register, pass it by value.  DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE.  If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference.  I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value.  In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference.  This may be fixed or the Xbox 360.  If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.

3) In a related note, always prefer returning a value to returning something by reference.  For example

// bad
void Add(Vector4 a, Vector4 b, Vector4& result);

Vector4 Add(Vector4 a, Vector4 b);

The above code is standalone (non-member) functions but this applies to member functions as well.  Remember that this is a very C/C++ thing.  If you are writing in a nutso language like C#, it  can be over 40x faster to return by reference because of the compiler’s inability to optimize simple struct constructors and copies.

4) When wrapping vector stuff in a struct, make as many member functions const as possible.  Avoid modifying this as much as you can.  For example

// bad, it sets a member in this
void X(FloatInVec scalar);

// good, it creates a temporary vec and returns it in registers
Vector4 X(FloatInVec scalar) const;

Not only does this help out the compiler, but it also allows you to chain stuff in longer expressions.  For example, some_vec.w(some_val).normalize().your_mom();

5) For math operations on built-in vector types, using intrinsics is not always the same as using operators.  Lets say you have two vectors.  There are two ways to add them

vec_float4 a;
vec_float4 b;
vec_float4 c = a + b;
vec_float4 d = spu_add(a, b);  // I like si intrinsics better but…

Which is better greatly depends on the compiler you are using and the version.  For example in older versions of GCC, using functions instead of operators meant that the compiler wasn’t able to do mathematical expression simplification.  It had semantic information about the operators that it didn’t have for the intrinsics.  However I have heard from a few compiler guys that I should avoid using operators because most of the optimization work has gone into intrinsics, since that is the most used path.  Not sure if this is still true but its definitely worth knowing the two paths aren’t necessarily equal and you should look out for what your compiler does in different situations.

6) When not writing directly in assembly, separate loads from calculations.  Its often a good idea to load all the data you need into vector registers before using the data in actual calculations.  You may even want to include a basic block splitter between the loads and calculations.  This can help scheduling in a few ways.

7) Depending on what you plan to do with your data, consider using SoA (structure of  arrays) instead of  AoS (array of structures).  I wont go too far into the details of SoA but it basically boils down to having 4 vectors containing {x0, x1, x2, x3}, {y0, y1, y2, y3}, {z0, z1, z2, z3}, {w0, w1, w2, w3} instead of the more “traditional” {x, y, z, w}.  There are a few reasons for using this.  First of all, if the code you are writing looks and feels something like this

FloatInVec dist = vec.x * vec.x + vec.y * vec.y + vec.z * vec.z

it can be a bit of a pain to do when your vectors are in {x, y, z, w} form.  There is a lot of painful shifting and moving things around, and a lot of stalls because you can’t add the x, y, and z products until you line them up.  Now lets look at this as SoA

Vector4 x_vals, y_vals, z_vals;
Vector4 distances = x_vals * x_vals + y_vals * y_vals …

image from slide 49 of Steven Tovey’s excellent presentation
for all your stupid SPU tricks, especially if you are doing 2D stuff