Comments on: Software Pipelining (failed video experiment) I think what you are talking about is called predicated writes. I actually use that quite a bit when trying to make stuf branch free. Very useful trick! I think what you are talking about is called predicated writes. I actually use that quite a bit when trying to make stuf branch free. Very useful trick!

]]>
By: Jon Rocatis/2011/03/16/software-pipelining-failed-video-experiment/#comment-2303 Jon Rocatis Sun, 03 Apr 2011 21:54:25 +0000 One "last" point. The thing I did really like was how the first video showed dependencies three different ways. If people were a little unsure of how dependencies worked and how SP can help eliminate them in a loop iteration, I'd hope that would really help them internalize it and get an intuitive feel for it. I would have liked to come up with intuitive data visualizations like that for the other videos One “last” point. The thing I did really like was how the first video showed dependencies three different ways. If people were a little unsure of how dependencies worked and how SP can help eliminate them in a loop iteration, I’d hope that would really help them internalize it and get an intuitive feel for it.

I would have liked to come up with intuitive data visualizations like that for the other videos

]]>
By: Jaymin Kessler/2011/03/16/software-pipelining-failed-video-experiment/#comment-1755 Jaymin Kessler Sat, 19 Mar 2011 11:46:25 +0000 Totally *not* a fail, Jaymin. It was great! You totally make me want to do some video posts too! :) Totally *not* a fail, Jaymin. It was great! You totally make me want to do some video posts too! :)

]]>
By: Pal-Kristian Engstad/2011/03/16/software-pipelining-failed-video-experiment/#comment-1734 Pal-Kristian Engstad Fri, 18 Mar 2011 22:59:55 +0000 When I heard otherOS support was being cut, I immediately went out and bought some PS3s. I have a dedicated dev box now that I use for programming, and other Torne/game PS3. While I don't want to encourage piracy, a dedicated PS3 Linux box with full access to the RSX would be pretty sweet. If you have no intention of ever using it for gaming, that might be the way to go. Not using it for gaming helps avoid the cat and mouse game of iPhone-style jailbreaking. Otherwise if that makes you uncomfortable, you can get fat PS3s off ebay and use those. When I heard otherOS support was being cut, I immediately went out and bought some PS3s. I have a dedicated dev box now that I use for programming, and other Torne/game PS3.

While I don’t want to encourage piracy, a dedicated PS3 Linux box with full access to the RSX would be pretty sweet. If you have no intention of ever using it for gaming, that might be the way to go. Not using it for gaming helps avoid the cat and mouse game of iPhone-style jailbreaking.

Otherwise if that makes you uncomfortable, you can get fat PS3s off ebay and use those.

]]>
By: Jonathan/2011/03/16/software-pipelining-failed-video-experiment/#comment-1695 Jonathan Fri, 18 Mar 2011 02:12:48 +0000 The OOO window on modern CPUs is big, but not infinite, and giving it less inherent stress means it has more slots free to absorb cache misses and branch mispredictions. It's still a good idea to intermix A and B as much as you can as long as you don't run out of registers. And of course you get twice as many of them in 64-bit mode! The OOO window on modern CPUs is big, but not infinite, and giving it less inherent stress means it has more slots free to absorb cache misses and branch mispredictions. It’s still a good idea to intermix A and B as much as you can as long as you don’t run out of registers. And of course you get twice as many of them in 64-bit mode!

]]>
By: Guy Sherman/2011/03/16/software-pipelining-failed-video-experiment/#comment-1687 Guy Sherman Fri, 18 Mar 2011 00:59:46 +0000 Glad you liked it! I think next I will try translating some of my Really liked the video article. It was like attending a conference talk. Sooo much better than trying to read the slide decks people normally post from things like GDC talks, and imagine all the useful stuff they're saying that isn't on the slides. Really liked the video article. It was like attending a conference talk. Sooo much better than trying to read the slide decks people normally post from things like GDC talks, and imagine all the useful stuff they’re saying that isn’t on the slides.

]]>
By: Pal-Kristian Engstad/2011/03/16/software-pipelining-failed-video-experiment/#comment-1681 Pal-Kristian Engstad Thu, 17 Mar 2011 22:04:51 +0000 Thanks for the interesting video. I was wondering if this was stuff you do by hand (the assembly jiggery-pokery), but from what you say in the above reply your example code was all generated by the compiler? I understand that in real cases you may need to do some stuff by hand to get absolute optimal code. Your talks are about the Cell SPUs, but is this kind of thing also available on PC CPUs? I don't recall ever seeing it mentioned for those, and a quick google doesn't tell me otherwise. In any case, I'm looking forward to the next trick. Thanks for the interesting video. I was wondering if this was stuff you do by hand (the assembly jiggery-pokery), but from what you say in the above reply your example code was all generated by the compiler? I understand that in real cases you may need to do some stuff by hand to get absolute optimal code.

Your talks are about the Cell SPUs, but is this kind of thing also available on PC CPUs? I don’t recall ever seeing it mentioned for those, and a quick google doesn’t tell me otherwise.

In any case, I’m looking forward to the next trick.

]]>
By: Jaymin Kessler/2011/03/16/software-pipelining-failed-video-experiment/#comment-1671 Jaymin Kessler Thu, 17 Mar 2011 12:51:10 +0000 Really good presentation - excellent work in explaining this in a very accessible way. I've been trying to explain this to someone recently, and I've not managed to explain it as well as this! There are a couple of points I'd make though. In the worked example of scheduling in video 6, you keep saying "wrap round and choose the first available slot". In your examples, this was always true, but it's worth mentioning that it should be "the first available slot after (dependency_postion+latency)%ii_length" - i.e. in the first example, the slot should be slot 9, so this is wrapped around to become slot 2, and the next available slot is slot 3 in this case. Also, you haven't touched on it, but it's also critical to understand the instruction set available to you and think about alternatives to the code the compiler generated or even to change the algorithm completely. In this example, the loop can be reduced down to a 4-slot loop. You're adding 16 to each address using the 3 ai instructions. Instead, you could use a single index register and use the lqx, stqx forms instead, meaning that you can elimate 2 of the adds. You then would end up with 4 even and 4 odd instructions in the loop: li $12,0 .L4: ai $9,$9,1 lqx $11,$8,$12 lqx $3,$7,$12 ceq $4,$6,$9 fm $10,$11,$3 stqd $10,$5,$12 ai $12,$12,16 .L8: brz $4,.L4 Obviously, in this case, you'd also want to unroll the loop at least once as the latency on the fm is 6 cycles, but at least you're now at a better starting point as you have balanced even and odd schedules. This is a simple optimisation, but there are other algorithmic changes that might make more of an impact. Personally, I tend to focus on how to massage the data before the loop such that it's as close as possible to how the tight loop needs it. For example when processing some swizzled image data, rather than calculating the deswizzled address every time you need it, it might be quicker to deswizzle an entire tile, process it and then reswizzle it. Really good presentation – excellent work in explaining this in a very accessible way. I’ve been trying to explain this to someone recently, and I’ve not managed to explain it as well as this! There are a couple of points I’d make though.

In the worked example of scheduling in video 6, you keep saying “wrap round and choose the first available slot”. In your examples, this was always true, but it’s worth mentioning that it should be “the first available slot after (dependency_postion+latency)%ii_length” – i.e. in the first example, the slot should be slot 9, so this is wrapped around to become slot 2, and the next available slot is slot 3 in this case.

Also, you haven’t touched on it, but it’s also critical to understand the instruction set available to you and think about alternatives to the code the compiler generated or even to change the algorithm completely. In this example, the loop can be reduced down to a 4-slot loop. You’re adding 16 to each address using the 3 ai instructions. Instead, you could use a single index register and use the lqx, stqx forms instead, meaning that you can elimate 2 of the adds. You then would end up with 4 even and 4 odd instructions in the loop:
li $12,0
.L4:
ai $9,$9,1
lqx $11,$8,$12
lqx $3,$7,$12
ceq $4,$6,$9
fm $10,$11,$3
stqd $10,$5,$12
ai $12,$12,16
.L8:
brz $4,.L4

Obviously, in this case, you’d also want to unroll the loop at least once as the latency on the fm is 6 cycles, but at least you’re now at a better starting point as you have balanced even and odd schedules.

This is a simple optimisation, but there are other algorithmic changes that might make more of an impact. Personally, I tend to focus on how to massage the data before the loop such that it’s as close as possible to how the tight loop needs it. For example when processing some swizzled image data, rather than calculating the deswizzled address every time you need it, it might be quicker to deswizzle an entire tile, process it and then reswizzle it.

]]>
By: Tom/2011/03/16/software-pipelining-failed-video-experiment/#comment-1663 Tom Thu, 17 Mar 2011 08:44:41 +0000 ah yes, I could have made that more clear in the animation. I could add something showing the instruction whose result we need and maybe an arrow or animation showing the earliest possible time we can schedule something. How is this? The instruction on the right has an arrow pointing to the dependency on the left (just like it does now), but then we have an arrow showing how long the instruction we depend on takes (and therefore the earliest we can schedule the new instruction)? Or I can have some rectangle blocking off the area while we are waiting for the result of the instruction? If its really that unclear to someone of your level, I should probably make a new revision as other less experienced people will be way more confused ah yes, I could have made that more clear in the animation. I could add something showing the instruction whose result we need and maybe an arrow or animation showing the earliest possible time we can schedule something.

How is this? The instruction on the right has an arrow pointing to the dependency on the left (just like it does now), but then we have an arrow showing how long the instruction we depend on takes (and therefore the earliest we can schedule the new instruction)? Or I can have some rectangle blocking off the area while we are waiting for the result of the instruction?

If its really that unclear to someone of your level, I should probably make a new revision as other less experienced people will be way more confused

]]>
By: Jaymin Kessler/2011/03/16/software-pipelining-failed-video-experiment/#comment-1649 Jaymin Kessler Thu, 17 Mar 2011 01:58:12 +0000 The dependency arrows between instructions are not displayed until after you've started talking about whether there's a dependency or not, which is why I wondered if there was audio lag. After watching for long enough it becomes clear(er), but there was not enough information up front for me to make sense of it from the start. Either an introductory explanation or dependencies that are visible throughout would make it clearer, I think. The dependency arrows between instructions are not displayed until after you’ve started talking about whether there’s a dependency or not, which is why I wondered if there was audio lag. After watching for long enough it becomes clear(er), but there was not enough information up front for me to make sense of it from the start.

Either an introductory explanation or dependencies that are visible throughout would make it clearer, I think.

]]>
By: Jeremiah Zanin/2011/03/16/software-pipelining-failed-video-experiment/#comment-1645 Jeremiah Zanin Thu, 17 Mar 2011 01:34:33 +0000 This is great! Thank you for posting this. Failure? I don't think so. This is great! Thank you for posting this. Failure? I don’t think so.

]]>
By: Jaymin Kessler/2011/03/16/software-pipelining-failed-video-experiment/#comment-1642 Jaymin Kessler Thu, 17 Mar 2011 01:03:06 +0000 Great videos, very enjoyable. Thanks, Jaymin! To start with, I was a bit confused about dependencies in the example in Episode 6 - not sure if the audio is lagging the video or whether I'm just slow :P Volume could perhaps be a bit higher as well. I look forward to your next videos :) For further reading, here's a <a href="http://www.naughtydog.com/docs/gdc2010/intro-spu-optimizations-part-2.pdf" rel="nofollow">SPU software pipelining optimisation example</a> from Pal-Kristian Engstad at Naughty Dog. Great videos, very enjoyable. Thanks, Jaymin!

To start with, I was a bit confused about dependencies in the example in Episode 6 – not sure if the audio is lagging the video or whether I’m just slow :P Volume could perhaps be a bit higher as well.

I look forward to your next videos :)

For further reading, here’s a SPU software pipelining optimisation example from Pal-Kristian Engstad at Naughty Dog.

]]> By: Glenn Watson/2011/03/16/software-pipelining-failed-video-experiment/#comment-1638 Glenn Watson Thu, 17 Mar 2011 00:28:06 +0000