Comments on: SIMD Constants [...] whenever I want to use a constant. There seems to be some info on generating certain constants (here and here – section 13.4), but its all assembly (which I would rather [...] [...] whenever I want to use a constant. There seems to be some info on generating certain constants (here and here – section 13.4), but its all assembly (which I would rather [...]

]]>
By: Jesse Cluff/2011/06/14/simd-constants/#comment-6183 Jesse Cluff Wed, 22 Jun 2011 02:45:44 +0000 Just to be clear, my argument was against forming 1.0 or similar constants using the bit-manipulation (5 insns in the example given) as opposed to forming it with 2 insns but slightly longer latency. Loading it from memory is rarely desirable, it pccupies a cache line and risks a cache miss. If you can easily form it with 2 insns thats usually better. The SPU in particular is excellent at forming just about any splatted constant with 1 or 2 insns, as well as any byte mask and many other useful things. For example, you can form {0,1,2,3,4,5,...,13,14,15} with two insns (CWD, ANDBI) then add a splatted byte to it to make a permute mask for handlimg misaligned loads. (though it had the same latency as a load, 6 cycles, and there's no cache misses to worry about... Anyway, the point was just that lots of useful constants can be formed in clever ways in various intruction sets, and its fun to try and work them out!) Just to be clear, my argument was against forming 1.0 or similar constants using the bit-manipulation (5 insns in the example given) as opposed to forming it with 2 insns but slightly longer latency. Loading it from memory is rarely desirable, it pccupies a cache line and risks a cache miss. If you can easily form it with 2 insns thats usually better.

The SPU in particular is excellent at forming just about any splatted constant with 1 or 2 insns, as well as any byte mask and many other useful things. For example, you can form {0,1,2,3,4,5,…,13,14,15} with two insns (CWD, ANDBI) then add a splatted byte to it to make a permute mask for handlimg misaligned loads. (though it had the same latency as a load, 6 cycles, and there’s no cache misses to worry about… Anyway, the point was just that lots of useful constants can be formed in clever ways in various intruction sets, and its fun to try and work them out!)

]]>
By: Luke Hutchinson/2011/06/14/simd-constants/#comment-5695 Luke Hutchinson Wed, 15 Jun 2011 23:18:20 +0000 Agree with moo. This technique in general is worth knowing about, and sometimes definitely a win over loading constants from memory. But these articles are not giving the full picture due to not mentioning things like additional stalls incurred due to structural / pipeline hazards, or the fact that the compilers will almost completely re-order both instruction order and register use in an optimized build etc. I too would be surprised if the simple two instruction version was slower when part of any larger body of code. Agree with moo. This technique in general is worth knowing about, and sometimes definitely a win over loading constants from memory. But these articles are not giving the full picture due to not mentioning things like additional stalls incurred due to structural / pipeline hazards, or the fact that the compilers will almost completely re-order both instruction order and register use in an optimized build etc. I too would be surprised if the simple two instruction version was slower when part of any larger body of code.

]]>
By: moo/2011/06/14/simd-constants/#comment-5681 moo Wed, 15 Jun 2011 17:29:53 +0000 0.75 wasn't the best example, since its 3 times 2^-2 and can be formed with two instructions as easily as 1.0. I get the idea though. 0.75 wasn’t the best example, since its 3 times 2^-2 and can be formed with two instructions as easily as 1.0. I get the idea though.

]]>
By: Jesse Cluff/2011/06/14/simd-constants/#comment-5678 Jesse Cluff Wed, 15 Jun 2011 16:23:45 +0000 I certainly wouldn't want to rely on my constant data always being on the same cache line as my code. Also even if the data is in the cache it still takes some time to load from there. And yes cache misses impact performance, that's why we try and reduce them. Cache misses are especially a problem in gameplay and AI code. By your arguments we shouldn't bother even using the normal one term immediates, which is simply ridiculous. Frankly I'm finding this discussion tiresome, I just wanted to provide an alternative, if you don't want to use it then don't. I certainly wouldn’t want to rely on my constant data always being on the same cache line as my code. Also even if the data is in the cache it still takes some time to load from there. And yes cache misses impact performance, that’s why we try and reduce them. Cache misses are especially a problem in gameplay and AI code. By your arguments we shouldn’t bother even using the normal one term immediates, which is simply ridiculous. Frankly I’m finding this discussion tiresome, I just wanted to provide an alternative, if you don’t want to use it then don’t.

]]>
By: Luke Hutchinson/2011/06/14/simd-constants/#comment-5671 Luke Hutchinson Wed, 15 Jun 2011 13:57:16 +0000 The locality of code and the locality of constant data is usually related. If you haven't called this module for a while, you can expect cache misses for both code and constant data. You can save a cache miss for cold code execution by eliminating all constant accesses, but that's unlikely; moreover, if such cache misses impact your execution time by a measurable amount, you're likely doing something wrong (i.e. a lot of unrelated small things after one another) anyway. The locality of code and the locality of constant data is usually related. If you haven’t called this module for a while, you can expect cache misses for both code and constant data. You can save a cache miss for cold code execution by eliminating all constant accesses, but that’s unlikely; moreover, if such cache misses impact your execution time by a measurable amount, you’re likely doing something wrong (i.e. a lot of unrelated small things after one another) anyway.

]]>
By: Jesse Cluff/2011/06/14/simd-constants/#comment-5663 Jesse Cluff Wed, 15 Jun 2011 06:34:15 +0000 Eliminating load-hit-stores is obviously achievable without such measures - just load the constant from memory. In fact, GCC did a proper code-generation for immediate constant values (i.e. (vector float){1, 2, 3, 4}) since forever, and even MSVC now can properly extract the immediate vector constant data to rdata. The benefit of removing the need to touch the memory is questionable - if the memory is in cache, it's the same as touching code memory. If it's not in cache... well, loading instructions could've caused a cache miss too. And oh yes, I find the need to replace human readable constant values by magic expressions cumbersome. When you need to change 0.75 to 0.85, and you need to think about how to do that... Besides, if you don't have a lot of constants, why do this thing at all? Eliminating load-hit-stores is obviously achievable without such measures – just load the constant from memory. In fact, GCC did a proper code-generation for immediate constant values (i.e. (vector float){1, 2, 3, 4}) since forever, and even MSVC now can properly extract the immediate vector constant data to rdata.

The benefit of removing the need to touch the memory is questionable – if the memory is in cache, it’s the same as touching code memory. If it’s not in cache… well, loading instructions could’ve caused a cache miss too.

And oh yes, I find the need to replace human readable constant values by magic expressions cumbersome. When you need to change 0.75 to 0.85, and you need to think about how to do that… Besides, if you don’t have a lot of constants, why do this thing at all?

]]>
By: Jesse Cluff/2011/06/14/simd-constants/#comment-5660 Jesse Cluff Wed, 15 Jun 2011 04:53:07 +0000 For PPU, in case of a 2-term solution, you've traded 16b of data space and 8b of code space for plain constant construction for 20b of code space if p's are distinct (8b for each constant and 4b for addition). You also made constant construction cumbersome (GENERATE_CONSTANT2(1, 1, 1, 2) instead of 0.75f?). Why would you want to do that?.. On SPU it's even more pointless (absolute load in 1 insn, fixed 6c load latency). For PPU, in case of a 2-term solution, you’ve traded 16b of data space and 8b of code space for plain constant construction for 20b of code space if p’s are distinct (8b for each constant and 4b for addition). You also made constant construction cumbersome (GENERATE_CONSTANT2(1, 1, 1, 2) instead of 0.75f?). Why would you want to do that?..

On SPU it’s even more pointless (absolute load in 1 insn, fixed 6c load latency).

]]>