If you somehow end up enjoying any part of this article, please consider following me on Twitter. I’m currently locked in mortal combat with Q-Games’s Ariel to be the first one to 1000 followers and my pride couldn’t handle her winning

  • Follow on Twitter

“I find both you and your project worthless” – guy on twitter (paraphrased)


I have always loved acronyms.  Ever since writing an elf relocatable instrumenting profiler called Non Intrusive Profiling Platform for Loadable Elfs, I have strived to find “clever” acronyms for all my projects.  Even my job title is an acronym! Debugging and Optimizing Unified Cache Hierarchies Engineer, or at least thats what I assume it stands for when my coworkers call me the company douche.  Its is in this spirit that I am pleased to present TREMBLE.


Thats me, the company D.O.U.C.H.E

Thats me, the company D.O.U.C.H.E



TREMBLE, or Trivially Reordered Evaluation Multicore Bi-pipeline Loop Engine, is a tool for scheduling, optimizing, and software pipelining loops written in an SPU assembly-like language.  Actually, it /is/ SPU assembly with some extra directives to give various hints to the scheduler.  Its somewhat similar to another tool only available to licensed PS3 developers but is public and developed using 100% publicly available algorithms and information.  Once the code gets cleaned up a bit, and becomes a little more feature complete, I’ll throw it up on git hub (whatever that is) and hopefully it will become something like the Jaymin Kessler center for kids who can’t pipeline good :)




I am not a compiler writer, nor an assembler writer.  Sure I did one back in college but nothing that interesting.  As a result, I’m sure the Codeplay guys will look at my tool and be horrified at how I missed obvious data structures and how my “optimizations” break everything but the two cases I tested.


Next, let it be known that I am doing this for fun and for my own education.  Yes I can optimize my own code, but writing a program to replace myself that works with programs I didn’t write and have never seen seems like an incredibly interesting challenge.  As such, I’m sure some of the things I say in this article may be a little off or will betray my status as an optimizing assembler writing n00b. But hey, you never know what interesting problems come up until you try!  Anyway, the point is that even though initial results are quite promising, I am only interested in having fun / experimenting / learning and don’t care much about beating tool X or compiler Y.


Finally, I wrote this tool in C#, a language I spend much of my time bashing for its atrocious performance.  So why would I do something so out of character? Well, I noticed that I do a lot of my regex processing in PERL already which isn’t exactly a language I’d do a game engine in, so why discriminate against C# (aside form it being a M$ language)?  Besides, I don’t care if the codegen is crap. I only care that the generated code’s generated code is fast, if you follow my meaning.


Before getting started, familiarize yourself with some stuff


Because the reading is for nerds, I made some youtube videos.  The first is an introduction to what software pipelining is and why we may want to use it, the second is a boring discussion of algorithms that few people but me care about, and the last one is a lively animation that demonstrates the scheduling I am currently doing.  They can be found here:




You don’t have to watch them in 1080p, but I suggest at least 720p!


Cheat sheets are a great way to pass tests


I know things about the SPU ISA.  That doesn’t mean I want to manually enter all that stuff in.  In a very rare display of laziness, I decided to begin with a txt file version of the Insomniac SPU cheat sheet that I had laying around.  From there, its trivial to read in the instruction mnemonics, registers, descriptions, pipeline, latencies, and functional units.  I did have to make a few changes, though.  My program parser assumes that rt means destination register and ra is an input operand.  However, instructions like branches and stores use rt despite not actually writing values to that register.  So I went through and changed some instructions to use ra instead


stqd  ra, s14(ra) store quadword (d-form)  odd   6

I am also storing off other info about if an instruction is a branch, or a store.  The only other modification was the ability to add comments.  This allowed me to not only document my changes, but also comment out instructions that aren’t allowed.


Parsing and building program info


I will now try to explain my crapfest of branches and for loops in the simplest way possible.  I have a 128 element array (one per register) of the following struct



struct RegOwner
      int inst_num;
      int times_used;


Every time an instruction writes to a register, it becomes that register’s owner.  Every time another instruction uses the result, times_used is incremented.


We start by looping over all instructions, and then each input operand for the current instruction.  Going through each non-immediate input operand, we look at the register number and look it up in the RegOwner table.  If no one owns the register, either its an input coming into the loop or its written to / updated by a later instruction to be used when we loop around again.  If the register does have an owner, we do the following (assuming the nth input operand):


  1. increment reg_owners[reg_num].times_used
  2. set current instruction register n dependency to the reg_owners[reg_num].inst_num
  3. add the current instruction to original_program[reg_owners[reg_num].inst_num].reg_info[0].dependencies


Its actually a bit more involved than that and I am storing more data in the register info struct, but you get the idea.  We have some kind of weird messed up dependency graph where each instruction’s output operand has a list of things that are dependents and each instruction’s input operands have exactly one instruction as a dependency.  Below is some of the insanely redundant info I am storing off





public class DepPair
      public int m_inst_num = -1;
      public int m_oper_num = -1;
  public class RegisterDependencyNode
      public List<m_dependents> = new List();    // if output, things that depend on our result (can be many)
      public DepPair m_dependency;  // if input, things we depend on (can only be one)
      public Program.OperandInfo.OperandType m_operand_type;
      public int m_allocated_reg = -1;
      public int m_immediate_arg = -1;
      public bool m_is_locked = false;
  public class InstructionNode
      public string m_opcode;
      public Program.Pipe m_pipeline;
      public int m_num_cycles;
      public bool m_include_in_schedule = true;
      public List<m_reg_info> = new List();  // all regs


Once the input operands are done, we do the output operand (if any).  This is as simple as saying the current instruction is the register’s new owner by doing


reg_owners[reg_num].inst_num = current_line;

reg_owners[reg_num].times_used = 0;


Lots of other little things are done in the main loop, but the only one really worth mentioning is tracking labels.  While technically I only allow branching at the end of the loop to the start of the loop, and loop counters that are incremented/decremented by immediates, I decided to go ahead anyway and associate labels with lines of code, of course remembering to adjust for comment lines!


The “parser” itself is done with regexes.  I’m using it to collect operands, opcodes, and comments, and it turned out to be much easier than writing a real parser!


Scheduling pass


For now, I am doing this:


I schedule the loop branch in the last slot of the odd pipe, and the first instruction in the first slot (first iteration) of whichever pipe it belongs to.  Now I loop over the rest of the instructions to schedule.  For each instruction, I look at its dependencies, where they are scheduled, and when they finish.  The last of those to finish is the first slot I am allowed to schedule the next instruction in.  If its not empty, I keep moving down the schedule looking for a free slot, and wrapping around and continuing when I fall off the edge of the schedule (the schedule length is the minimum initiation interval, or max(odd_instruction_count, even_instruction_count)).  Really, there isn’t much I can say that is more clear than the animation in the video.


Assigning registers


For obvious reasons, I can’t use all the same registers used in the original program.  Anti-dependencies (for example) make it dangerous when instructions start getting rescheduled or wrapping around the schedule.  My initial process of assigning registers was so ghetto that its almost embarrassing.  It basically “worked” like this.  In the original loop, input operand registers that have no dependencies and output operands than have no dependents are untouchable.  This is because they are either inputs or outputs to the loop and should be left alone.  All other registers are fair game.  So we start out with a range of usable registers and remove ones on the no-touch list.  Then for each instruction, if the output operand is not in the do-not-touch list, we assign it a register from the pool available to instruction in its iteration.  Then for every register that depends on the output operand, we change its register to the one we just assigned.  Thats it.  No SSA, no fancy anything.  It was something quick and dirty to get me up and running.  It was also horribly broken.


The concept of no-touch registers still seems semi-valid, but it doesn’t fix the antidependency problem.  Because I am doing register allocation after scheduling, I ended up having to try and insert register copies in some cases to make sure the registers read in iteration n+1 are different than the ones written in n.  I only have to do this for some cases, though.  If a result has no dependents in later iterations, or the dependents are used in instruction slots earlier than where we write the value, then we don’t need to copy (I think…).   Imagine this hypothetical situation.  We have the following program


a) add r32, r31, r31

b) add r33, r32, 4

c) add r34, r32, 4

which then gets scheduled as:

0) add r33, r32, 4 (instruction b, iteration n + 1)  ; uses r32 from previous iteration

1) add r32, r31, r31  (instruction a, iteration n)     ; writes a new value in r32

2) add r34, r32, 4 (instruction c, iteration n + 1)  ;  error: _should_ use r32 from previous iteration


In general, to be safe at the end of iteration n we have to copy the value in r32 to another reg, however notice that its only a problem if we have an instruction that reads r32 AFTER schedule slot 1 (aka instruction c).  If c didn’t exist, we wouldn’t need to copy


Contrived example: Lets say you have the following program


      shufb   $0,$0,$0,$0
      shufb   $1,$1,$1,$1
      shufb   $2,$2,$2,$2
      shufb   $3,$3,$3,$3
      a       $5,$0,$0
      lqd	    $4,0($4)
      lqd	    $8,0($8)
      a       $7,$5,$8
      brz $4, .L4


The minimum initiation interval is 7, so the second add won’t be scheduled until the second iteration.  Actually, it will try to schedule it in the schedule slot where the first add already is, and so it gets placed in the next even slot after the first add.  Long story short, we have a second iteration instruction later in the schedule than the first iteration instruction it depends on, so a copy is required to fix the antidependency and stop the register from living too long.  Here is the TREMBLE output


scheduling logic:
  scheduling shufb in slot 0 iteration 0
  scheduling shufb in slot 1 iteration 0
  scheduling shufb in slot 2 iteration 0
  scheduling shufb in slot 3 iteration 0
  scheduling a in slot 4 iteration 0
  scheduling lqd in slot 4 iteration 0
  scheduling lqd in slot 5 iteration 0
  scheduling a in slot 5 iteration 1
  register move required, inserting slot 6 pipe EVEN
  best schedule found: 7 cycles with 2 iterations in flight
  shufb r0 r0 r0 r0
  shufb r1 r1 r1 r1
  shufb r2 r2 r2 r2
  shufb r3 r3 r3 r3
  a r16 r0 r0
  lqd r4 0(r4)
  lqd r8 0(r8)
  ori r18 r16 0
  0) nop (iter 0)
  0) shufb r0 r0 r0 r0 (iter 0 )
  1) nop (iter 0)
  1) shufb r1 r1 r1 r1 (iter 0 )
  2) nop (iter 0)
  2) shufb r2 r2 r2 r2 (iter 0 )
  3) nop (iter 0)
  3) shufb r3 r3 r3 r3 (iter 0 )
  4) a r16 r0 r0 (iter 0)
  4) lqd r4 0(r4) (iter 0 )
  5) a r7 r18 r8 (iter 1)
  5) lqd r8 0(r8) (iter 0 )
  6) ori r18 r16 0 (iter 0) live too long: reg copy
  6) brz r4 0 (iter 0 )




The prologue is basically stepping through each instruction in the new schedule until you execute the last instruction whose result is needed for the instructions scheduled in the last iteration.  Its just that simple.


Removing unused instructions


This is surprisingly easy because I have an array of structs mapping register numbers to the instructions that own them.  Lets look at the following example

0: add r32, r31, r31

1: add r33, r32, r32     ; r33 owned by instruction 1, usage count = 0

2: add r33, r100, r101 ; previous r33 owner has a usage count of 0


Look at instruction 2.  It wants to write to r33 so it checks the previous owner’s usage count.  Instruction 1 wrote to r33 but that value was never used (count = 0).  Therefore we can remove instruction 1.  Going further, we can look at instruction 1’s input operands and remove them from the list of r32’s dependents.  After removing instruction 1, instruction 0 has no dependents anymore and can also be removed.


The reason I am not just removing instructions with zero dependents is because an instruction that seems to have no dependents can also be either a loop output (needed after the loop) or something needed at the beginning of the next loop iteration.


Contrived example:


     a $32, $32, $32
     a $31, $31, $31
     a $34, $31, $32 ; removable because the only instruction depending on r34 is removable
     a $33, $31, $32 ; removable because the only instruction depending on r33 is removable
     a $32, $33, $34 ; unused instruction setting off the removal chain
     a $32, $35, $36
     brz $127, again


TREMBLE output:


unused instruction found on line 4: a r32 r33 r34
  unused instruction found on line 3: a r33 r31 r32
  unused instruction found on line 2: a r34 r31 r32
  unused instruction found on line 1: a r31 r31 r31
  unused instruction found on line 0: a r32 r32 r32
  scheduling logic:
  scheduling a in slot 0 iteration 0
  best schedule found: 1 cycles with 1 iterations in flight
  0) a r32 r35 r36 (iter 0)
  0) brz r127 0 (iter 0 )


Limitations and future work


Here is a list of stuff I plan to do / fix/ make suck less


  1. I am really interested in automatically unrolling loops and am experimenting with this pretty heavily at the moment.  Supposedly some types of module scheduling can benefit greatly from pre-unrolling loops.  However, it seems to be tricky in some cases determining what to unroll and how to unroll with complex logic updating the registers that serve as base addresses for stores.  If there is a real “official” way to do it, please don’t tell me.  I want to see what I can come up with on my own.
  2. When scheduling instructions, I am currently looking at them in order.  Evidence suggests there is some benefit to pre-ordering them by the number of dependents, with registers having many dependents scheduled first.
  3. Register allocation isn’t as efficient as it could be. For example if you write a value to r32 in iteration 1, slot 0 and that value only lives until iteration 1 slot 2, technically we should be able to reuse that register for calculations that don’t carry over to the next iteration.  Also, I am assigning registers in a range with no preference for volatile regs and no spilling non-volatile regs.  That is straight up broken and needs to be fixed.  Speaking of registers…
  4. My fix for registers living too long can be a bit more clever.  Maybe if I take registers into account when scheduling…
  5. I would like to implement some other more interesting scheduling algorithms, and I need to back up and unschedule instructions when instruction scheduling fails.  Actually, at the moment scheduling never fails which is pretty horrific
  6. I’d like to do pipeline balancing in the case where one pipeline is used much more heavily than the other
  7. Lots of other optimizations I’d like to try
  8. I am generating prologues but not epilogues (yet).  Also, my logic to try and collapse epilogues and prologues is a little… incomplete.  s/incomplete/nonexistent
  9. The only branches that are allowed are branches back to the beginning of the loop, and loop counters have to be updated by adding/subtracting an immediate for unrolling to work
  10. DMA: I am not even thinking about this right now
  11. Massive redundancy, inefficiency, and redundancy.  Clearly my mood and direction changed every day I worked on this.  Needs cleaning up…
  12. It might be cool to eventually make some kind of iPad game or educational tool out of this.  The game could be things like “beat this schedule” and the educational tool would be more interactive and visually show you how the code was scheduled.  Maybe even serve as a reference where you could mouse over instructions to get info on them, or search for instructions in drop down menus when entering in programs interactively