A brief history of code temperature.

Since the dawn of the computing age processor speed has been dominated
by memory speed, a phenomenon known as “the memory wall”. No matter
how many ALUs you throw onto the silicon, applications will always be
limited by the speed of access of the memory.

To solve this problem, caches were introduced and to an extent they
help to reduce the memory bandwidth issues. But they come at a considerable
price, not only in terms of the silicon real-estate they occupy, but
primarily in terms of additional latency.

The result of this is that benchmark code performance is improved,
but real code performance takes a big hit. I’d like to explain why
this is and what we can do to mitigate the effects.

What is code temperature?

In compiler terms, code is split into “hot” code which is frequently
executed and “cold” code which is executed at most once a frame.

Caches do not help game loops on the whole as by the time the loop
is re-run the caches have been flushed and all code ends up being
cold.

Cache-locking and cache affinity can help, but require O/S assistance
that is not generally available.

The anatomy of a game loop.

We all know about the kinds of computational tasks that our games
carry out in the game loop. In particular, a great deal of emphasis
is always placed on the hot loops that perform matrix stack calculations
or physics simulation, but very little is understood about cold
code which makes up the bulk of the code executed in a loop.

In fact, as compiler engineering is largely about statistics, we frequently
analyze the composition of game programs to work out where to put our
optimization work.

Developers always look the closest at their hot code as it is an easy
optimization target and in fact, most of the game code we see has had
its hot loops removed and transferred to coprocessors, which is a
largely mechanical task.

Hot code is easily identified from profile traces and as it is generally
very small, it is easy to tweak loops to improve performance, for
example by manually unrolling them or using GPGPU coding.

This tends to leave the cold code dominating the performance of the
game loop and to the developer this looks like a uniform background
noise as the “whack-a-mole” style of profile-fix-profile cycles
flatten out the hot functions.

Making cold code go faster

When we analyse the composition of game loops we find that the vast
majority of instructions are either loads, address calculations,
conditional branches and calls/returns. Floating point opeations
and general integer operations are actually very rare with less
than 1% of a typical game using vector and floating point operations.

Function prologues and epilogues are the single most identifyable
part of the code and prologues often account for 10% or more of the
execution time, mostly because the branch to the function is
mis-predicted, usually as a result of a virtual function call.

Pass-by-reference is very expensive as we need extra instructions
to store to memory and load back again. On some CPUs which have
no store forwarding mechanism, this can seriously impact performance.

A good linker should provide an option to sort by call order. This has
the side effect of minimizing cross-page branches and reducing iCache
pollution. Block reordering within cold functions can also help, but
the heuristics are different from hot code block reordering common in
GCC and other compilers.

Interrupt handlers and O/S activity is also a common cause of cold code
slowdown. If you catch your audio programmers using sleep(), you
should engage in a program of re-education as any O/S call will
be a cache-killer.

On the whole we find the best way to improve cold code performance
is to reduce code size in all but the hottest parts of the code.

What your compiler can do for you.

The science of cold code optimisation is still in its infancy,
so the contributions that current compiler technology can give
are still very limited.

You can write smaller code, for instance, but this is very
difficult even for experienced programmers. Making game logic
data-driven is a good step in the right direction.

On Intel and ARM platforms there are a limited set of code size
reducing optimizations associated with -Os (/Os) groups.

If you are building on windows you may find that your whole
project will run faster on -Os, even the hot loops.

The Thumb-2 instruction set has been a great boost to lowering
code size on mobile devices. Many instructions are 16 bit
and if the compiler has good register allocation algorithms
(and I don’t mean Chaitin-Briggs!) then it is possible to
choose the maximum number of 16 bit instructions in a function.

Register allocation is the focus for my current commercial
work with an estimated 10% performance gain to be had
in some real-code cases.

Scripting

Ironically, scripted code can often run as fast as C++ code
in cold cases. This is why Java JITs often do not compile
functions until they have been proved to be hot.

A good bytecode can be more compact than the machine code
of the processor, and scripting engines can pull many tricks
to batch up opcodes.

This suits the general trend for migrating games towards
dynamic languages.

Dynamic languages are also easier to execute in parallel
and the data layout can be changed to improve performance.
A large part of compiler optimization effort is going into
the optimization of dynamic languages such as Lua and Python.

A taxonomy of optimizations

We can represent compiler optimisations on a sliding scale:

[generalizing------------------------------specializing]

Specialization is the most familiar of these transforms.
Once code is inlined, many opportunities exist to specialize
the code based on constants and invariant behaviour.
Almost all optimizations found in text books are of this form.

Generalization is less well understood and its primary purpose
is to reduce code size, perhaps by making function calls
that replace repeating sections of code or loops that
replace stores to subsequent locations.

The vast bulk of game code consists of parametrized
function calls. Often these parameters are the same for
many or all calls, which gives us a great specialization
opportunity that will also reduce code size.

Profile guided optimisation

Profile guided optimisation gives us the opportunity to
discover which regions of code are hot and which are
cold.

This usually means running several phases, instrument,
execute, compile with results.

Good compilers will provide PGO information that has good
longevity and will improve code performance through many
edits to the code. However, many of the implementations
available are extremely crude and it is usually best
to manually mark a few of your functions as hot.

For example, if the compiler unrolls all the loops, then
the hot code will run faster, but the cold code will run
much slower, negating any benefit.

It is a common error to use the -O3 settings on compilers,
which are targeted at benchmarks, rather than -O2 or -Os
which give a compromise for real code.

LLVM

We hope that LLVM will provide a good framework for much of
our future research work. It will provide the global view
of programs we need to perfect whole-program generalization
optimizations through link-time code generation.

Being open source, it is easy to share results unlike the
proprietary compilers used by many console manufacturers.

Most of the work to-date has focussed on benchmark results
rather than real code performance, so we look forward to
applying the principles we have learned from console compiler
development to the open source world.

LLVM is not the last word, however, as its many-IRs approach
to code generation already looks dated and its complexity
makes it unsuitable for dynamic languages.

Next time

Next time I’ll show some more specific examples of cold code
performance and how to measure it.

I am planning to do a talk on LLVM in games at the Euro
Game Connection conference in Paris in December with my
“Goldsmiths” hat on and may see some of you there.

http://www.game-connection.com/gameconn/content/gameconnection-europe