Abusing C++ is an exploration of how far a compiler can be pushed using metaprogramming techniques. Templates are typically frowned upon in game development due to the additional complexity and the increased build times, but they are an incredibly powerful mechanism that should not be discounted wholesale. In this series we’ll explore situations where metaprogramming provides a new outlook for solving a problem faced in game development.

In this first article we’ll explore expression templates, a technique used in mathematics libraries to produce optimal assembly while presenting a simplified interface. This initial portion will demonstrate the technique via a valarray implementation.

I was first exposed to Java while pursuing my Bachelor’s. Coming from a C++ background one of the first deficiencies I noticed was the lack of operator overloading. The professor explained it away as a response to people abusing this facility in C++. While I have encountered ill-advised use of operator overloading, mathematical constructs cry out for this ability.

As an example, which of these is easier to understand?

result = a * b + c * d;           // Makes perfect sense
 
   
 
  result = add(mul(a,b), mul(c,d)); // If you said this you’re either lying or use
 
                                    // a language without operator overloading

Now lets assume we’d like to apply various mathematical operations to numerical arrays. Our main goal is to provide an interface that’s consistent with the built in mathematics types. This allows the clients of the library to write code that is succinct compared to code utilizing the array directly.

// valarray implementation
 
  result = a * b + c * d;
 
   
 
  // C array implementation
 
  for (std::size_t i = 0; i < size; ++i)
 
      result[i] = a[i] * b[i] + c[i] * d[i];

Before we begin it should be noted that we will be creating functionality already present in the Standard Template Library, specifically the valarray class. Expression templates were developed initially with valarray implementations in mind, though there is nothing in the STL spec that mandates their usage. This classic example provides a great starting point for delving into the technique. For the duration lets forget about its presence in STL and get started on our own implementation.

A Naive Implementation

Given the requirements a straightforward implementation would resemble the following.

template <typename Real>
 
  class valarray
 
  {
 
  public:
 
   
 
      //------------------------------------------------------------
 
      // Additional methods and sanity checking removed for brevity
 
      //------------------------------------------------------------
 
   
 
      valarray(std::size_t size)
 
      : _size(size)
 
      , _values(new Real[size])
 
      { }
 
   
 
      ~valarray()
 
      {
 
          delete[] _values;
 
      }
 
   
 
      valarray& operator= (const valarray& copy)
 
      {
 
          assert(_size == copy._size);
 
   
 
          for (std::size_t i = 0; i < _size; ++i)
 
              _values[i] = copy._values[i];
 
   
 
          return *this;
 
      }
 
   
 
      inline valarray operator+ (const valarray& rhs) const
 
      {
 
          valarray result(_size);
 
   
 
          for (std::size_t i = 0; i < _size; ++i)
 
              result._values[i] = _values[i] + rhs._values[i];
 
   
 
          return result;
 
      }
 
   
 
  private:
 
   
 
      std::size_t _size;
 
      Real* _values;
 
   
 
  } ;

So what does the execution of our original expression look like?

// result = a * b + c * d; becomes
 
   
 
  valarray<Real> temp1 = a * b;        // Loop n times, multiplying elements
 
                                       // n items are allocated and later deallocated
 
  valarray<Real> temp2 = c * d;        // Loop n times, multiplying elements
 
                                       // n items are allocated and later deallocated
 
  valarray<Real> temp3 = temp1 + temp2 // Loop n times, adding elements
 
                                       // allocation/deallocation n elements
 
  results = temp3;                     // Read/Write n times

So our naive implementation makes a couple mistakes that will drastically affect performance, both of which boil down to the temporary instances being created. If we speak in terms of Big O the calculation running time of the calculation would be O(4n), and the storage requirements balloon to O(3n). While technically both behaviors in Big O speak simplify down to O(n) the reality of the situation is bleaker. The additional looping and memory usage results in an implementation that runs orders of magnitude slower than treating it like a C array.

const std::size_t size = a.size();
 
   
 
  for (std::size_t i = 0; i < size; ++i)
 
      result[i] = a[i] * b[i] + c[i] * d[i];

This will generate the optimal assembly for this operation, but doesn’t meet our requirements. We’d still like to treat the valarray how we would treat a built-in mathematical construct. So the question becomes can we coax the compiler to create the optimal assembly while keeping our preferred interface?

The answer is yes, and we can achieve this utilizing a technique called expression templates.

Expression Templates

Using metaprogramming we can create mathematical constructs that provide a terse interface and generate ideal assembly. Expression templates are used extensively in a number of high performance computing libraries such as Eigen. The technique has also been utilized to create lambdas, though this usage is moot now that lambdas have been added to the C++ standard.

To generate assembly that matches the optimal case we need to implement code that builds the for loop. Actualizing this requires us to control two things; the temporary instances being created, and when the entire expression is evaluated.

We’ll begin by changing the type of object the mathematical operators return. We want to create an object that wraps the mathematical operation specified, and references an underlying array or the results of another operation. This requires the internal storage of the valarray to be modified to accomadate this. From there we specify when the expression is evaluated by utilizing the assignment operator. Finally, we’ll walk through the steps required by the compiler to produce the optimal assembly.

Encoding the Expression

As we saw when executing our naive implementation, the temporary instances of our valarray caused our execution time, and storage requirements, to grow substantially. Rather than returning an entirely new instance of the valarray class when encountering a portion of the expression, we can respond with an object that wraps each mathematical operation.

As an example we’ll encode the multiplication into an expression. This is done through a lightweight structure that wraps the operation.

template <typename Real, typename Op1, typename Op2>
 
  class valarray_mul
 
  {
 
  public:
 
   
 
      valarray_mul(const Op1& a, const Op2& b)
 
      : _op1(a)
 
      , _op2(b)
 
      { }
 
   
 
      Real operator[] (std::size_t i) const
 
      {
 
          return _op1[i] * _op2[i];
 
      }
 
   
 
      std::size_t size() const
 
      {
 
          return _op1.size();
 
      }
 
   
 
  private:
 
   
 
      const Op1& _op1;
 
      const Op2& _op2;
 
  } ;

The first point of interest is the template arguments. The first argument, Real, corresponds to the underlying data type contained in the valarray. The second argument, Op1, refers to the left hand side of the expression, while the third argument, Op2, refers to the right hand side of the expression. This allows us to chain together multiple expressions since each side of the expression will be expanded during compilation.

The second point of interest is the constructor. The values passed are constant references which ensures no additional storage is created.

The final point of interest is the array operator. The actual computation is not done until that operator is invoked. This plays into the final portion of the implementation, where the entire expression is evaluated, so just keep it in the back of your mind for the moment.

Next we need to modify the valarray class to interact with our expressions. This requires a layer of indirection within the internal storage of the valarray so it can store not only an array of values, but also the results of an expression. The container must implement a simple array interface, which is utilized by the valarray holding it. Our valarray_mul class meets this prerequisite.

template <typename Real, typename Rep>
 
  class valarray
 
  {
 
  public:
 
   
 
      valarray(const Rep& rep)
 
      : _rep(rep)
 
      { }
 
   
 
      valarray(const valarray& copy)
 
      : _rep(copy._rep)
 
      { }
 
   
 
      std::size_t size() const
 
      {
 
          return _rep.size();
 
      }
 
   
 
      Real operator[] (std::size_t i) const
 
      {
 
          return _rep[i];
 
      }
 
   
 
      const Rep& rep() const
 
      {
 
          return _rep;
 
      }
 
   
 
  private:
 
   
 
      Rep _rep;
 
  } ;

The valarray simply wraps the methods exposed in the storage class, allowing it to hold an instance of the valarray_mul class. The multiplication operator now needs to be modified to return such an object.

template <typename Real, typename Lhs, typename Rhs>
 
  valarray<Real, valarray_mul<Real, Lhs, Rhs> > operator*
 
      (const valarray<Real, Lhs>& lhs, const valarray<Real, Rhs>& rhs)
 
  {
 
      return valarray<Real, valarray_mul<Real, Lhs, Rhs> >
 
          (valarray_mul<Real, Lhs, Rhs>(lhs.rep(), rhs.rep()));
 
  }

This is where the moniker expression templates comes from. We’re using the compiler to encode the expression tree at compile time, creating the following.

The expression tree encoded


So the end result of our original expression would be an object of the following type.

valarray<Real, valarray_add<
 
      valarray<Real, valarray_mul<valarray<Real>, valarray<Real> > >,
 
      valarray<Real, valarray_mul<valarray<Real>, valarray<Real> > > >

The results of C++ operators isn’t the only thing that can be encoded. Other operations, such as the square root, can be implemented within the expression.

// sqrt expression creation
 
  template <typename Real, typename Op>
 
  valarray<Real, valarray_sqrt<Real, Op> > sqrt(const valarray<Real, Op>& value)
 
  {
 
      return valarray<Real, valarray_sqrt<Real, Op> >
 
          (valarray_sqrt<Real, Op>(value.rep()));
 
  }
 
   
 
  // sqrt expression implementation
 
  template <typename Real, typename Op>
 
  class valarray_sqrt
 
  {
 
  public:
 
   
 
      valarray_sqrt(const Op& a)
 
      : _op(a)
 
      { }
 
   
 
      Real operator[] (std::size_t i) const
 
      {
 
          return std::sqrt(_op[i]);
 
      }
 
   
 
      std::size_t size() const
 
      {
 
          return _op.size();
 
      }
 
   
 
  private:
 
   
 
      const Op& _op;
 
   
 
  } ;

With the expression encoded we can move on to its evaluation.

Evaluating the Expression

The final piece of the puzzle involves controlling when the expression is computed. It’s totality needs to be present before proceeding with the evaluation. This functions as a sort of lazy evaluation on the whole formula. The place within the code to invoke the expression is within an assignment operator.

template <typename Rep2>
 
  valarray& operator= (const valarray<Real, Rep2>& copy)
 
  {
 
      std::size_t count = size();
 
   
 
      for (std::size_t i = 0; i < count; ++i)
 
          _rep[i] = copy[i];
 
   
 
      return *this;
 
  }

As you can see this is where the last piece of the puzzle, the for loop, is generated. The assignment operator is templated so it can accept any combination of expressions contained within the valarray being assigned. From there it just invokes the array operator which in turn will invoke each portion of the expression.

Rather than taking my word for it, lets mimic the compiler. We’ll walk through the template expansion done on the code.

Expanding the Expression

With the code in place we can explore how the compiler will expand the expression. Expression templates rely on aggressive inlining by the compiler. To transform our code to the equivalent C array each expression must be inlined in its entirety. Before walking through the inlining lets take a quick detour into how compilers accomplish this.

The inline keyword, and its nonstandard, more explicit cousin __forceinline, act as C4714 with a pragma. Our implementation actually requires the __forceinline specifier to result in optimal code due to the amount of nesting within the templates.

With that said let’s walk through the steps the compiler will need to take in order for it to generate the optimal assembly. We’ll be working in the ideal sense, as the actual output is dependent on the compiler used.

To reiterate, the expression results in the following object being created.

result = valarray<float, valarray_add<
 
      valarray<float, valarray_mul<valarray<float>, valarray<float> > >,
 
      valarray<float, valarray_mul<valarray<float>, valarray<float> > > >

Expansion of the assignment operator gives us this.

template <>
 
  valarray& operator= (const valarray<float, /* see above */ >& copy)
 
  {
 
      assert(size() == copy.size());
 
      std::size_t count = size();
 
   
 
      for (std::size_t i = 0; i < count; ++i)
 
          _rep[i] = copy[i];
 
   
 
      return *this;
 
  }

From here we can begin expanding the copy argument. We’ll examine an expansion of the valarray_mul instance, and then move onto the full expression.

template <>
 
  valarray<float, valarray_mul<valarray<float>, valarray<float> >
 
  {
 
      float operator[] (std::size_t i) const
 
      {
 
          return _rep[i];
 
      }
 
  } ;
 
   
 
  template <>
 
  valarray_mul<valarray<float>, valarray<float> >
 
  {
 
      float operator[] (std::size_t i) const
 
      {
 
          return _op1[i] * _op2[i];
 
      }
 
  } ;
 
   
 
  // Expands into
 
  a[i] * b[i];

To get the full expression we expand the array operator for a valarray_add instance.

template <>
 
  valarray<float, valarray_add<
 
      valarray<float, valarray_mul<valarray<float>, valarray<float> > >,
 
      valarray<float, valarray_mul<valarray<float>, valarray<float> > > >
 
  {
 
      float operator[] (std::size_t i) const
 
      {
 
          return _rep[i];
 
      }
 
  } ;
 
   
 
  template <>
 
  valarray_add<valarray<float>, valarray<float> >
 
  {
 
      float operator[] (std::size_t i) const
 
      {
 
          return _op1[i] + _op2[i];
 
      }
 
  } ;
 
   
 
  // Expands into
 
  a[i] * b[i] + c[i] * d[i];

So inserting this back into the assignment operator we arrive back at our optimal code.

const std::size_t count = size();
 
   
 
  for (std::size_t i = 0; i < count; ++i)
 
      _rep[i] = a[i] * b[i] + c[i] * d[i];

Limitations

A discussion of expression templates wouldn’t be complete without mentioning its limitations. As you can see we’re created quite a bit of code to essentially avoid writing a for loop. This has increased the complexity of the code that has to be maintained, and as a corollary has increased the amount of effort required to debug it.

It also increases the workload of the compiler which has to expand and inline the templates. The types generated by this technique are deeply nested, and will require a significant amount of time to compile. If this technique is heavily used within a codebase it will greatly influence the build time, which will drag down a programmer’s productivity. The quality of output is also highly dependent on how well the compiler handles these transformations, which may or may not have the same performance characteristics of the C array implementations.

Before utilizing this approach you should identify where the pros outweigh the cons. The technique would probably not meet the level of control required for low level engine code, as it doesn’t have mechanism to apply optimization techniques such as prefetching. It may meet the needs of gameplay code as it provides a simplified interface that compiles to ideal code.

Expression templates are by no means a magic bullet, but just another tool in the toolbox that can be wielded when the need arises.

Up Next

By utilizing expression templates we’ve managed to create a simplified client interface that results in optimal code generation. This allows us to do serious number crunching as quickly as possible. There is an additional avenue for increasing the performance of our valarray; adding SIMD support.

In the next article we’ll add SIMD support to our valarray, using not one, not two, but three SIMD architectures. Implementations will be provided using intrinsics for SSE, NEON, and AVX. Finally we’ll run some performance comparisons between our implementations, and examine the output generated by the compiler.

The code for this and further installments is available in a Github repository.

Also posted to my personal blog.