Jaymin Kessler (Q-Games)

http://6cycles.maisonikkoku.com

@okonomiyonda

When we last left off two articles ago ( vectiquette, or alternatively “What GCC Wants” ) I was just about to get around to talking about dual issue pipelines, and things you can do to up that all important indicator of your worth as a human: the dual-issue rate. Since then I have been thinking a bit about some of the other little things you can do that help out. I figured it would be fun to look at some actual examples and see the effects in the assembly.

This is by no means an exhaustive list. Quite the opposite, this is going to have to be relatively short as I am off to GDC soon. As a result some of it may be disjointed, out of place, spelled wrong, weird grammer, jumping all over the place, nonsensical, or just weird. Sorry.

(note to readers: I will try to point out things that can be different among various CPUs where applicable, but I will mainly be talking about one CPU in particular.)

**Now Here’s a Funky Introduction**

As you can probably guess, dual-issue is all about issuing two instructions in the same cycle. It is also referred to as superscalar issue, where scalar comes from the latin word meaning “to suck” and super probably means “to do something less”? Types of dual-issue CPUs have been around since the Crays of the 1960′s, and its fairly common in most modern CPUs these days.

(very very very simplified SPU diagram that doesn’t show pipeline stage details (borrowed from IBM))

On the SPUs, 1-word instructions are fetched in groups of 32 at a time (128 bytes), eventually making their way to the instruction line buffer *. From the instruction line buffer, the instruction issue unit fetches and decodes a doubleword aligned pair of instructions. This introduces some limitations. You can’t just dual-issue any two consecutive instructions starting at an arbitrary address. Because of the way the instructions are fetched in pairs two at a time from a doubleword boundary, you can only dual issue if the first instruction in the pair is an even pipe instruction and the second is odd (note: loads and stores have to wait for free LS cycles).

Don’t freak if you’ve never heard the terms odd and even instructions. On the SPUs, all instructions have a type, and each type usually has to be routed to one of seven execution units via a specific pipeline, although I think some chips are looser with this restriction. These pipelines are usually called something like Odd and Even, A and B, 1 and 2, Fry and Laurie, or something like that. In the diagram above you see that permutations, loads/stores, channel instructions, and branches are odd pipeline instructions, while everything else is even.

Finally, the instructions can’t have dependencies on each other’s results. This makes sense because one instruction would have to write its result out to register before the other instruction fetches its input operands from registers (or some fancy result forwarding thingy) if you were to dual-issue and not get garbage.

Note that dual-issue doesn’t guarantee simultaneously issued instructions will finish at the same time.

* cool but rarely useful: 3.5 fetched lines are stored in the instruction line buffer. One line is for the software managed branch target buffer, two lines are used for inline prefetching, and that 0.5 line holds instructions while they are sequenced into the issue logic. I didn’t actually know that before. I accidentally found it while looking for articles to verify that what I thought I knew about instruction issue wasn’t total lies. Pretty cool, right?

**Dopplegangers**

Let’s get started by looking at an SPU relocatable in a static pipeline analyzer

The left side (red) is the even pipeline and the right side (blue) is odd. Instructions that occur on the same line are issued the same cycle, and cycle numbers are estimates that don’t always reflect the actual time your program will take (blah blah blah branch mispredicts, consecutive loads and stores from the same address, etc…). The first thing you should take note of is that giant glaring hole in the even pipeline from cycles 15 to 28. Those are cycles where the odd pipeline is working its butt off while the even pipeline lazily loafs about like some new-age hippie freeloader. Thats what we want to prevent.

Let’s say we wanted to do something that consisted of 4 odd instructions. Since the odd pipeline is pretty much full, adding a bunch more odd instructions would definitely increase the length of our loop. However, if we could find a way to do what we want in even instructions, we might be able to schedule them into that big gap, essentially getting them for free. Even if the even pipeline instruction version consisted of 12 instructions, it may still be a win. Mind you all of this is a gross oversimplification. First of all, I mention instruction count but I don’t say if the instructions have dependencies on previous instructions (either in its own pipeline or in the other) or how many cycles each instruction is. Also pure odd <–> even swaps are rare. Usually you make a version that reduces, not eliminates, certain pipeline instructions at the cost of adding a bunch of instructions from the other pipe.

So let’s see a concrete example completely unrelated to the odd pipeline bound example above. This little nugget of wisdom comes from Mark Cerny. The straightforward way to calculate the absolute value of a floating-point register is to AND the register with 0x7FFFFFFF (1 even instruction, 2 cycles). If your code is even bound, it is possible to calculate fabs() on a pair of floating-point registers (8 floats total) using 6 odd instructions:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | // Shuffle most significant bytes into a single qword, separated by zero bytes shufb temp, input1, input2, s_0A0E0I0M0a0e0i0m // Left-shift by one bit. The sign bits are now in their own bytes. shlqbii temp, temp, 1 // Shuffle in-place to mask off the sign bits. shufb temp, temp, temp, s_0B0D0F0H0J0L0N0P // Shuffle the MSBs back into their original places shufb output1, input1, temp, s_bBCDdFGHfJKLhNOP shufb output2, input2, temp, s_jBCDlFGHnJKLpNOP |

If you have *four* registers worth of floats, you can obviously do 4 even or 12 odd or 2 even / 6 odd. Another option is 1 even / 7 odd: 3 shufbs to get the sign bytes into one register, a single xorbi 0×80, and then four shuffles to get the sign bytes back where they belong.

So that was an example of how to take something that was normally 1 even instruction, and rewrite it using 6 odd instructions. This next example comes from the Naughty Dog ICE team and it was passed along to me by Cort Stratton. All these examples take some three element vectors and transpose from mixed to channel format. Some use entirely odd instructions, some use equal amounts of even and odd, and some are even pipe heavy. The original document contains a huge number of variations and I am pasting three random ones here.

(0 even/7 odd) Latency: 11 cycles

—————

1 2 3 4 5 6 7 8 9 10 11 12 13 | shufb temp1, in1, in2, s_AaBb // temp1 = x1 x2 y1 y2 shufb temp2, in3, in4, s_BbAa // temp2 = y3 y4 x3 x4 shufb temp3, in1, in2, s_CcCc // temp3 = z1 z2 z1 z2 shufb temp4, in3, in4, s_CcCc // temp4 = z3 z4 z3 z4 shufb out1, temp1, temp2, s_ABcd // out1 = x1 x2 x3 x4 shufb out2, temp2, temp1, s_cdAB // out2 = y1 y2 y3 y4 shufb out3, temp3, temp4, s_ABcd // out3 = z1 z2 z3 z4 |

(4 even/4 odd) Latency: 10 cycles

—————

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | selb temp1, in1, in3, m_0FF0 // temp1 = x1 y3 z3 ?? selb temp2, in2, in4, m_0FF0 // temp2 = x2 y4 z4 ?? shufb temp3, in3, in4, s_CcAa // temp3 = z3 z4 x3 x4 shufb temp4, in1, in2, s_BbCc // temp2 = y1 y2 z1 z2 shufb temp1, temp1, temp2, s_AaBb // temp1 = x1 x2 y3 y4 selb out1, temp1, temp3, m_00FF // out1 = x1 x2 x3 x4 selb out2, temp4, temp1, m_00FF // out2 = y1 y2 y3 y4 shufb out3, temp4, temp3, s_CDab // out3 = z1 z2 z3 z4 |

(8 even/3 odd) Latency: 14 cycles

—————-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | shufb temp1, in3, in4, s_AcBa // temp1 = x3 z4 y3 x4 selb temp2, in1, in2, m_0FF0 // temp2 = x1 y2 z2 ?? selb temp3, in1, in2, m_FF00 // temp3 = x2 y2 z1 ?? selb temp6, temp1, temp2, m_FF00 // temp6 = x1 y2 y3 x4 shufb temp7, temp1, temp3, s_caAB // temp7 = z1 x2 x3 z4 selb temp4, in2, in1, m_FF00 // temp4 = x1 y1 z2 ?? selb temp5, in3, in4, m_FF00 // temp5 = x4 y4 z3 ?? shufb temp8, temp4, temp5, s_BCcb // temp8 = y1 z2 z3 y4 selb out1, temp6, temp7, m_0FF0 // out1 = x1 x2 x3 x4 selb out2, temp8, temp6, m_0FF0 // out2 = y1 y2 y3 y4 selb out3, temp7, temp8, m_0FF0 // out3 = z1 z2 z3 z4 |

Which version would you want to use? Hopefully by now you wont automatically respond with “the one with the fewest instructions” or “the one that takes the least cycles.” The correct answer is: it depends entirely on where you want to schedule it in! So do the work, and check your pipeline analyzer.

**Unrolling**

Since everyone already knows what unrolling is, I thought it would be cool to see its slot-filling effect in action. Below is a function that loads two float vectors, adds them, and stores out the result.

The add is done on cycle 10 and takes 6 cycles to complete, so we can’t do the write until cycle 16. Since there is no other work we can do in-between, we waste a ton of time. However, maybe we can do two vector adds at once. The vector’s read can go between the first vector’s read and the first vector’s add. The second vector’s add can go between the first vector’s add and the first vector’s store. Lets see if it works!

see what I did there? The loop is still exactly 17 cycles but now we are processing two vectors instead of one. Should we press our luck and try 4x unrolling? Big speedup, big speedup, no whammies!

That one wasn’t free. Our loop is now 21 cycles but we’re processing 4 vectors at a time. Ok, one last experiment. 8X UNROLL, GO!!1!

30 cycles for 8 vectors. Thats less than 2x the cost of the original but it processes 8x as much data. Is it worth it? Maybe, thats for you to decide. Mind you I am trying to limit myself to unrolling only. There is a number of wasteful things the compiler is doing in the above example that if fixed, could bring the cost way down closer to the level of the original unrolled example.

Obviously there are some things to beware of: First of all, unrolling increases code size and therefore can be hell on your I-cache on certain platforms. Also, another SPU specific problem is that unrolling can place too much pressure on the local store. There is already so little memory to share between code, data, and job kernels, that tripling your memory usage just really isn’t feasible sometimes. You have to be weary of your register file size. Do you have 16, 32, 64, or 128 registers? Any performance benefit you get from unrolling is pretty much right out the door as soon as you start spilling to stack. Finally, be aware that diminishing returns be in full effect. Unroll too much and not only will things stop getting faster, but they will really start getting much slower.

**The SoA<–>AoS Example I Should Have Included Last Time**

I should have included this in the vectiquette article. But I didn’t. So I will. Now.

A great semi problem-independent example of the benefits of SoA is the dot product. For those who may not yet have discovered Khan Academy as a great way of sucking less at life, the dot product of two vectors returns a scalar and works like this:

V1 = {X, Y, Z, W}

V2 = {A, B, C, D}

R = V1 dot V2 = (X * A) + (Y * B) + (Z * C) + (W * D)

The first part is trivial. All we have to do is multiply V1 and V2 and store the result in some register. What happens next is the truly horrific part: we have to access each component of the result and add them together. Think back to vectiquette and you’ll remember that individual component access is one of the most horribly mean and cruel things you can do to your poor poor vectors. Let’s see what this single dot product looks like on the SPU

The fm (cycle 7) does the floating point multiply, and that is followed by three rotates (cycle 13). We rotate by 1, 2, and 3 words respectively so the the first element of each vector contains something we want to add. Lets say you have const vec_float4 jenny = {8.6, 7.5, 3.0, 9}. The whole vector rotations would look like this

{8.6, 7.5, 3.0, 9} // rotated 0 words

{7.5, 3.0, 9, 8.6} // rotated 1 word (4 bytes)

{3.0, 9, 8.6, 7.5} // rotated 2 words (8 bytes)

{9, 8.6, 7.5, 3.0} // rotated 3 words (12 bytes)

Thats how we get the separate vector components to line up so we can add them. We then do three adds (cycles 17, 23, 29) to add the results into the X component of the final result and then shuffle to splats the value across the vector. Note that the shuffle is completely unneeded since the correct value is already in each vector element. Mind you this is also wasteful because you are using a whole vector to store a single floating point dot product. If we were to rewrite this in SoA format, instead of having two vectors each containing X, Y, Z, W values, we would have 8 vectors where each contains only X values, only Y values, only Z values, or only W values. It would look something like this

There are a couple cool things here. First of all, since we didn’t have to rotate the results of the multiply, we were able to use fma, or floating point multiply and add. Second, notice that despite having to load more vectors, this version is fewer cycles than the previous version. That information is about to become all the more impressive when I tell you that we are now simultaneously doing four dot products. Let that sink in. Not only is the code itself faster to execute, but it does four dot products at once instead of one. FTW! This also means the output now contains 4 dot products instead of one dot product.

Now some of you will look at this and think… “but I have to do four dot products, I can’t do just one.” To this I have two responses. First of all, if your data is in SoA format, you better damn well have more than one thing you are processing at once. Second, unless you’re Doing It Wrong, if you have one thing to process than you have many. Allow me to clarify. If your game loop is set up right and you are properly processing things in batches, this won’t be a problem.

One last thing. The code above kinda bothered me in that it seems to be wasting a lot of time. As a fun exercise I tried unrolling 2x and then using software pipelining to to eliminate dependencies on loads. The results below show the 2x unrolled version (processes 8 dot products at a time)

and here is the pseudo software pipelining version. 17 through 34 represents the loop prologue (initial loads) and 0 through 28 is the actual dot product loop. The compiler is doing some inefficient stuff that could be remedied by going to asm, but you see my point. Maybe even without going to asm, using the X form of loads could help the compiler get rid of all those add immediates. Either way, the result is way faster than the original and processes 8x the data.

**Coming Soon: _Real_ Software Pipelining**

Originally I had planned do so a section on Software Pipelining, a subject very near and dear to my heart. However, quickly including it here just didn’t do justice to such an awesome technique, so I decided to save it for my next post. I’ll cover basic ideas, ways of doing it, and maybe even talk a little about pipelining assemblers like SPA and SPASM. Please look forward to it!

*“Take these words home and think ‘em through, or the next post I write might be about you.”*

*Mobb Deep (paraphrased)*

Jaymin Kessler