Beyond intrinsics to code idiom

A brief history of SIMD

I first encountered SIMD architectures back in the early 1990′s when I was writing MPEG and JPEG codecs for a variety of CPUs.

I was asked by a couple of chip manufacturers to come up with some extra instructions to help with the DCT (discrete cosine transform) calculation.

I had been using the 8087 FPU to do several calculations in parallel by dividing the 63 bit mantissa into groups of bits and doing three DCT calculations in the same time of one. A practice now called “structures of arrays”.

[zeros] [Y0] [zeros] [Y1] [zeros] [Y2]

The numbers were biased to make the result positive.

We could add these values and multiply by small constants, but had to constantly re-bias the numbers to prevent the YUV values from leaking into adjacent numbers.

For our SIMD design, we added a mode that disabled the carry between parts of the accumulator creating what David May, designer of the transputer, called a “split accumulator”.

SIMD architectures later appeared on other CPUs, eventually making it into X86 architectures in the form of 3D Now! and SSE. Now pretty much every flavour of CPU has a SIMD architecture, VMX/Altivec, NEON, Paired float are examples and SPUs and GPUs are exclusively SIMD.

SIMD now!

Today, most programmers either use inline assembler or intrinsics to implement vector and matrix classes.

Inline assembler is very much frowned upon these days, partly because of maintainability, partly because it prevents the compiler from optimizing around the asm statements and so intrinsics tend to be the norm.

Here is an example from an SSE vector class:

class vec4 {
    __m128 v128;
    vec4() {}
    vec4(__m128 value) { v128 = value; }
    vec4(float x, float y, float z, float w) { v128 = _mm_setr_ps(x, y, z, w); };
    float &operator[](int i) { return v128.m128_f32[i]; }
    const float &operator[](int i) const { return v128.m128_f32[i]; }
    vec4 operator*(float r) const { return vec4(_mm_mul_ps(v128, _mm_set_ss(r))); }
    vec4 operator*(const mat4 &r) const;
    vec4 operator+(const vec4 &r) const { return vec4(_mm_add_ps(v128, r.v128)); }

The _mm_add_ps and other functions generate the required vector operations to add all four lanes of the vector value together. On some compilers, these then participate in optimization by combining multiplies and adds, for example, into fused operations.

The advantages of this are obvious, we can perform four times the work of the scalar FPU in the same number of cycles.

Beware the SIMD of doom!

There are a very large number of caveats, however, and it is fact very rare to get any acceleration at all from SIMD classes, despite this obvious advantage.

First, in order to get scalar values into the right place in vectors, we need to execute “permutes” and copies to move scalar data to SIMD registers. So unless your compiler is able to convert FPU instructions to SIMD instructions behind the scene, you will end up with slower code than just writing scalar code.

Second, FPU and SIMD code does not mix, with a pipeline flush usually required between FPU and SIMD operations. This mode switch will usually cripple performance unless the SIMD code is run in a very clean environment.

Also, SIMD code requires the presence of an expert to debug it and must be re-written for every platform.

Moving on to idiom

To address these and other issues, compiler developers are moving towards the concept of “idiom” to represent corners of the target instruction set that can’t be reached by regular C code.

A code idiom is a patten of C code that will be translated into a specific machine instruction, if it is cost effective to do so. Often the patterns are quite complex reflecting the complex behaviour of the underlying architecture.

Often CPUs have complex instructions that can do many things at once. For example, most X86 instructions have a combination of address calculation, load, arithmetic and store. We are used to these being assembled correctly by the compiler.

But on some architectures, for example, there are instructions to load and byte swap or instructions to compute population counts and reverse bit patterns and most importantly, there are instructions to operate on vector data types.

class vec4 __attribute__( ( aligned( 16 ) ) {
    float v[4];
    vec4 operator+(const vec4 &r) const {
      return vec4(v[0]+r.v[0], v[1]+r.v[1], v[2]+r.v[2], v[3]+r.v[3]);

Here we have some code that will run on any platform that supports GCC-style alignment. Apart from the alignment of 16 bytes, this code is just regular C++ code.

The alignment is crucial for performance. In fact alignment in general is a useful optimization to avoid data structures falling across cache line boundaries, but that is another issue.

In this case having an aligned structure allows the compiler to combine all the loads of the four elements into one vector load.

On the ARM platform, any consecutive loads are candidates for multiple loads, both integer and floating point, up to a very large limit, and as over 50% of instructions executed are loads in a typical game, this is a considerable advantage. We also hope that in future architectures, scatter-gather DMA will be available for little cost in userland as it is on the Cell SPU and many GPUs.

Once loaded, it is a no-brainer to combine the four adds into a SIMD add and likewise the four stores. So by expressing our code in straight C code, we have achieved the same performance as writing platform specific intrinsics.

Here it is in a formal grammar.

{ p[0], p[1], p[2], p[3] } -> sse load
  { a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[3] } -> sse add

Inside the compiler, the combiner is perpetually searching for patterns that it knows it can reduce without penalty to a simpler set of instructions.

Given a target architecture, we try to provide combines that can generate every instruction available such as byte-swapping loads, combined shift and add, bitfield insertion and so on.

In an ideal compiler, we would not need to have intrinsics for any target instruction as we would simply have to publish the idiom.

Examples of idioms:

rotate = ( x << n) | ( x >> (32-n) );
  byte_swap = ( x << 24 ) | ( x << 8 ) & 0xff0000 | ( x >> 8 ) & 0xff00 | ( x >> 24 );

Infix vector operations

In addition to idioms, many compilers, such as CLang and GCC, support infix vector operations.

float __attribute__ ((vector_size (16))) a, b, c;
  c = a + b;

Here we have used the GCC vector_size notation to define a platform independent vector type.

We can add these types to avoid using intrinsics.

typedef float v4f __attribute__ ((vector_size (16)));
  v4f a, b, c;
  c = (v4f){ a[0], a[1], b[2], b[3] };

This is an example of a platform-independent permute. Although GCC supports this, it may not be efficient, but CLang gets it right.

Clang also allows vectors to be of arbitrary length, such as 64 elements.

Designing a good idiom

The ideal idiom for a CPU instruction should be one that is frequently observed in real code examples and ideally that represents an efficient alternative to that instruction, should the CPU not support it.

For example, although we could convert a loop that counted bits into a population count instruction, we would prefer to convert a set of shift/masks for the log(n) solution. Either should work, however, and the analysis is key to making this work.

Idioms are often disrupted by other optimizations, such as strength reduction, so a robust idiom detector is essential, for example by examining the origins of every bit of a result, searching for patterns.

LLVM provides some simple pattern matching in its instruction selection phase, and will be able to generate a wide variety of instructions. However, phase-based compilers, although easy to construct and maintain, are notorious for missing opportunities and often do not have a good cost model. The next generation of AI-driven cost-sensitive technology will hopefully perform much better than human generated heuristic models.

It is also worth noting that in many cases, using a specific machine instruction is not the most efficient thing to do. Although chip makers constantly extend the instruction set, they often provide little support for older instructions which get relegated to long pipeline delays.

Some we just can’t do

Some machine instructions are so deep that we just can’t represent them in C code.

Examples are access to special machine registers and pipeline syncronization instructions such as data memory barriers.

Atomics are also difficult to represent in the language, although many attempts have been made to add transacted memory blocks and atomic variables to the language like the Java volatile.

One thing is certain, that we don’t want to add layer upon layer of library code to the language to represent corner cases, that just makes the language harder to understand for beginners and presents barriers for optimization.

Example of infix vector notation

The following shows two ways of generating a vector cross product function, one with generic vector operations and one with machine-specific intrinsics.

I hasten to add that I haven’t tried the latter as I couldn’t find any working documentation on GCC’s SSE intrinsics.

The former compiles on CLang and another compiler, the latter shows what is required from intrinsics. I wanted to show how to do this using ARM’s intrinsics, but after twenty minutes or so of looking up functions, I gave up.

  typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
  float32x4_t cross_product(float32x4_t x, float32x4_t y) {
    // yz - zy, zx - xz, xy - yx, ??
    float32x4_t x1200 = (float32x4_t){ x[1], x[2], x[0], x[0] };
    float32x4_t y1200 = (float32x4_t){ y[1], y[2], y[0], y[0] };
    float32x4_t x2010 = (float32x4_t){ x[2], x[0], x[1], x[0] };
    float32x4_t y2010 = (float32x4_t){ y[2], y[0], y[1], y[0] };
    return x1200 * y2010 - x2010 * y1200;
  typedef __m128 float32x4_t;
  float32x4_t cross_product(float32x4_t x, float32x4_t y) {
    float32x4_t x1200 = _mm_shuffle_ps(x, x, 9);
    float32x4_t x1200 = _mm_shuffle_ps(y, y, 9);
    float32x4_t x2010 = _mm_shuffle_ps(x, x, 18);
    float32x4_t x2010 = _mm_shuffle_ps(y, y, 18);
    float32x4_t a = _mm_mul_ps(x1200, y2010);
    float32x4_t b = _mm_mul_ps(y1200, x2010);
    return _mm_sub_ps(a, b);

Example of rotate idiom on two compilers

unsigned rotate(unsigned value, unsigned amount) {
    return (value << amount) | (value >> (32-amount));
  This code generates the following from these two compilers:
          mov     eax, DWORD PTR _value$[esp-4]
          mov     ecx, DWORD PTR _amount$[esp-4]
          rol     eax, cl
          movb    8(%esp), %cl
          movl    4(%esp), %eax