This President’s day I’m happy to ford ahead and write this second article in a three part series. Generally I’m aiming to illustrate some of the times I made code less readable as a result of optimization. Last time I posted a case where we changed a run-time conditional into a templatized (thus compile-time) test to allow the compiler to remove it painlessly. Now I’m going to fillmore details in and talk about a more extreme case. As in my last post, this example comes from code running on the SPU.

PathNodes

A previous game I worked on had a lot of entities which fall under the description of PathFollowers. These are entities which basically follow a fixed path at a predictable (but not constant) speed. Although there are many varieties of path followers, their core functionality is provided by an element updated on the SPU called a PathNode.

While the PathFollower is a complex C++ virtual class, a PathNode is a simple (144 byte) struct. They are updated en masse and their owning C++ object is responsible for applying the transformed data at the appropriate point in the update loop. Once a frame, the PathNodes advance themselves along a piecewise linear path. Their behavior is mostly parametric – the distance moved, the path smoothing applied, and rotation about the roll axis. However, there are a few boolean parameters on each instance which can alter their update in a discontinuous fashion.

1
 
  2
 
  
  kPathNodeLoop = 0x04,     // if not set, movement will stop at end-of-path
 
    kPathNodeSegments = 0x10, // speed, accel, etc are measured in segments/s, not m/s

Looping PathNodes at the end of their path cycle will loop (assuming the path is closed). PathNodes can also advance at a multiple of segments a second (which are of non-uniform length) as opposed to meters a second. Both of these flags are tested in many of the functions used during a PathNode instance update.

Adapting this system to the SPU happened in the hayes of late project development. I wasn’t optimistic about replacing the entire system, I’d have to make due with simply porting it over and optimizing the largest existing code.

Code

Although PathNodes could theoretically transition from looping to non-looping or from travel-by-segment to non-travel-by-segment, the lion’s share of nodes were added when a path was created and remain in the same state until they’re destroyed. I had to grant nodes the ability to change state, but could optimize assuming it would not happen.

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  
  // 0: no meter, no loop
 
    // 1: no meter, loop
 
    // 2: meter, no loop
 
    // 3: meter, loop
 
    #define NOMETER_NOLOOP     0
 
    #define NOMETER_LOOP       1
 
    #define METER_NOLOOP       2
 
    #define METER_LOOP         3

Like the problem I discussed article, I’d be building separate versions of a routine, each taylored for a combination of the potential dynamic settings. The PathNode system partitioned the set of updating PathNodes into one of four groups. More specifically, it used a simple counting sort to populate four sets of indices into the complete set. This time I’d also be changing the instance virtual function update to one aggregated routine per partitioned group.

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  
  // body of pathnode_meter_loop.cpp, an SPU program
 
    #define PATH_METER_TOGGLE 1
 
    #define PATH_LOOP_TOGGLE  1
 
   
 
    #include "code_fragment_path.inl"
 
   
 
    extern "C" void async_pathnode_meter_loop(global_funcs_t* gt, update_set_info_t* info, 
 
                                              common_blocks_t* common_blocks,
 
                                              instance_streams_t* instance_streams, 
 
                                              u8* work_buffer, u32 buf_size, u32 tags[4])
 
    #include "pathnode_raw.inc"
 
    #undef PATH_METER_TOGGLE
 
    #undef PATH_LOOP_TOGGLE

For my meta-programming fix this time, I’d relied on C preprocessor defines instead of template parameters. Many functions didn’t need to be changed to account for static branches. Those that did, suffered from a simple if tedious set of transformation. For example, C++ if statements with relevant expressions were changed to always true or false depending on the high-level set of flags expressed during compilation.

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  7
 
  8
 
  9
 
  10
 
  11
 
  12
 
  13
 
  14
 
  15
 
  16
 
  17
 
  18
 
  19
 
  20
 
  21
 
  22
 
  23
 
  
  f32 GetEndT(Path* path)
 
    {
 
      f32 end_t = (f32)(i64)path->m_point_count;
 
      if (m_flags & kPathNodeLoop)
 
          end_t -= 1.0f;
 
   
 
      return end_t;
 
    }
 
   
 
    #if PATH_LOOP_TOGGLE
 
    f32 GetEndT_LOOP(Path* path)
 
    #else
 
    f32 GetEndT_NOLOOP(Path* path)
 
    #endif
 
    {
 
      f32 end_t = (f32)(i64)path->m_point_count;
 
   
 
    #if (PATH_LOOP_TOGGLE == 0)
 
      end_t -= 1.0f;
 
    #endif
 
   
 
      return end_t;
 
    }

Affected function names were changed as well to help ensure the right version was being compiled/called under the appropriate circumstances (some preprocessor redirection was used to help prevent invoking the wrong function through inconsistent C preprocessor #defines and file inclusion). There was also an explicit typed constant to avoid having to completely rewrite compound/complicated expressions. This didn’t totally pierce the readability issue but did help ameliorate some of the problems that arose from introducing C preprocessor metaprogramming.

1
 
  2
 
  3
 
  4
 
  5
 
  6
 
  
  #if PATH_LOOP_TOGGLE
 
      const bool LOOP_CONSTANT = true;
 
   
 
      #define PATH_GetEndT(p) PATH::GetEndT_LOOP(p)
 
      // ...
 
    #endif

Conclusion

The changes I’d made had an unexpected shelf life. In the process of porting it, I’d made the system functionally the same on PPU as on the SPU. The partitioning of instances, the aggregate update, the removed branches were the same on the PPU as SPU. After the project in which it was initially used, the SPU version fell into disuse and eventually the PPU version was used exclusively. This reflects a couple of things. First, the total number of path followers for subsequent projects has decreased from what in retrospect appears to be a high water mark. Second, the steps of prepping a system to run on the SPU also contribute to performance improvements on other platforms. (There are a third and forth which I’m omitting unless polked).

Compile time polymorphism

Next time, I’ll talk about how my metaprogramming mania reaches a fever with a scheme to break out virtual dispatch to the compile time. Also, I’ll be lincoln the points from all the articles and talk about the larger lessons.