This is the third (and possibly final) #AltDevBlogADay post to be inspired by a single tweet from two months ago. I figured it was about time I wrapped this up and let you all have some closure (I know the tension was killing me). First, a quick recap of the series:
- How is constant formed? looks at the basics of how simple constant values are loaded into registers, and methods for learning a little about the instruction set architecture of a particular CPU by examining the output of a C compiler. Go and read it, I’ll wait.
- Put The Arbitrary Constant In The Register takes a look at how much constant can be loaded with a single instruction — it is possible to load an entire 32 bit register with an arbitrary constant using a single instruction on x86, but not with ARM, MIPS, PowerPC and SPU — their fixed, 32 bit instruction length makes it impossible. Don’t believe me? Check out the article for the proof. You’ll be amazed. Probably not, but read it anyway.
- This article, right here, where I try to complete the picture, looking at two main approaches used to set all the bits in a register. Read it…
Here’s one I prepared earlier
The first method is simple. Really simple.
- Store the constant somewhere in memory
- Load the constant into the register with some kind of Load Data From Memory Into A Register instruction
because even the most reduced RISC instruction set still has instructions for accessing memory, right? Isn’t that what defines a load-store architecture? It’s something like that, anyway.
(Oh, also: let’s assume that between the programmer, compiler, assembler, linker and loader that a number we assign to a variable will actually be able to be found in memory when the program runs. Because it can. Trust me. Or listen to this talk where I don’t actually address that issue, but do cover related topics.)
Here’s something that isn’t suprising: All of these architectures have an instruction to load data from memory into a register.
And not just one instruction. There’s plenty of instructions to choose from. Some of them are even useful! To keep things simple, lets look at the instructions that GCC generates when a program needs to load 32 bits of data from a known address in memory, because there’s a good chance that they will do the right thing:
- ARM has a couple of Load Register (ldr) instructions,
- MIPS has Load Word (lw),
- SPU has Load Quadword Instruction Relative (lqr) and
- PowerPC has Load Word and Zero (lwz).
Not one of these is precisely identical to any other [0]. Two of them are sufficient to solve the problem by themselves — the other two need extra help.
Look behind you
The SPU Load Quadword Instruction Relative instruction is the simplest of these. An address is calculated from the program counter (the address of the lqr instruction) and an offset. The data from this address is loaded into a specified register. The programmer doesn’t need to know what these addresses actually are — the linker performs the necessary magic to make it all just work (Hooray!).
ARM’s Load Register instruction works very similarly, allowing data from a location relative to the program counter (or to some other register).
You have to load the constant
MIPS and PowerPC don’t support the same kind of PC-relative addressing that can be done with SPU and ARM, so they need some help, but not much — the address of the constant needs to be loaded into a register before it can be used to load our 32 bit constant.
So all we need to do is to load the address of the constant.
Which is just a number. An arbitrary 32 bit value.
An arbitrary 32 bit constant value [1].
We need to load an arbitrary 32 bit constant value so we can load an arbitrary 32 bit constant value?
:\
If only there was another way…
Ingredients: ones and zeros. Transform until desired composition is achieved.
11:15, restate my assumptions:
- Mathematics is the language of nature.
Sorry, wrong list. Try again.
- There are instructions for loading constant values into registers
- There are instructions for changing — transforming — values that are already in registers.
Maybe, just maybe, there is some combination of these two things that will allow us to solve the problem. This requires thought. Analysis of instruction sets. Consideration.
Alternatively, let’s see what the compiler does.
For a little program like this:
int constant() { return 0x499602D2; // 1234567890 }
GCC generates the following instruction sequences:
ARM: |
movw r0, #722 movt r0, 18838 |
MIPS: |
li $2,1234567168 ori $2,$2,722 |
PowerPC: |
lis 3,0x4996 ori 3,3,722 |
SPU: |
ilhu $3,18838 iohl $3,722 |
We see two instructions for each architecture. There’s an interesting of decimal and hexadecimal constants here (to keep you on your toes), combined in a couple of different ways.
MIPS, PowerPC and SPU have a similar combination of instructions: load the upper 16 bits with an immediate value (which zeros the lower 16 bits), and perform a logical OR with another immediate value. For this you can see the Load Immediate Shifted extended mnemonic for PowerPC (which is actually Add Immediate Shifted in disguise), the Immediate Load Halfword Upper instruction for SPU, and though it may look like MIPS somehow has support for 32 bit constants, that li line will be generated by the assembler is a Load Upper Immediate of 0×4996.
For the logical OR immediate, MIPS and PowerPC both have an Or Immediate instruction, and SPU has Immediate Or Halfword Lower (SPU registers are all 128 bits wide — when loading a 32 bit constant like this, the register ends up with four copies of the same 32 bits. Four times the bits at no extra cost!)
For ARM, the movw will Move the 16 bit immediate value into the lower 16 bits of the register zeroing the top 16 bits, and movt (being Move Top) will move the 16 bit immediate value into the top 16 bits without modifying the lower 16 bits.
In conclusion: we loaded that thing real good.
But why?
So if we can construct constants with two instructions, why would we load them? Particularly on an architecture like MIPS or PowerPC we we have to construct a constant (address) to load a constant?
The answer is that you wouldn’t. Unless you had to. Or it was better in some other way.
For example, for older versions of the ARM architecture, GCC will generate a load from memory. When targeting a newer variant (e.g. Cortex-A8), GCC will construct the constant with the movw + movt combination shown above. Assuming the compiler knows what it’s doing, the best choice depends on the hardware.
And what if we try loading constants that aren’t 32 bit integers? How about 64 bit values? Single and double precision floating point values? Loads into vector registers? What then?
Rather than more code, here’s a summary of what GCC generates for some different types of data for each architecture:
Arch | 32 bit int | 64 bit int | 32 bit float | 64 bit float | 128 bit vector |
---|---|---|---|---|---|
ARM (default) | load — 1 insns | load — 2 insns | load — 1 insn | load — 2 insn | load — 2 insns |
ARM (Cortex-A8) | construct — 2 insns | construct — 4 insns | load — 1 insn | load — 1 insn | load — 2 insns |
MIPS | construct — 2 insns | construct — 6 insns | load — 2 insns | load — 2 insns | n/a |
PowerPC | construct — 2 insns | construct — 5 insns | load — 2 insns | load — 2 insns | load — 3 insns |
SPU | construct — 2 insns | load — 1 insn | construct — 2 insns | load — 1 insn | load — 1 insn |
x86_64 | construct — 1 insn | construct — 1 insn | load — 1 insn | load — 1 insn | load — 1 insn |
(Details on what I compiled and how are in this small zip file)
There’s a few things that stand out to me:
Construction can be cheaper than buying prefab if you’re a long way from the warehouse
Six instructions to construct a 64 bit constant for MIPS?? That’s 6×32 bits = 24 bytes of code to generate 8 bytes of constant. Even with five instructions for PowerPC, and the four for ARM seem over the top.
I blame memory latency. A cache miss when loading data load can cost a lot of time — around 400 cycles of navel gazing on Cell. A few extra instructions is no problem, especially when they were probably already in the instruction cache.
For ARM, the change from loading to generation between the default GCC output (for “.cpu arm10tdmi” — whatever that is) and Cortex-A8 output reflects changes in the design of the processor and memory speeds. I suspect.
Context is everything
For Cortex-A8, MIPS, PowerPC and x86_64, ints are generated but floats and vectors are loaded. These architectures use different register sets for integers. floats and vector values, and have a different set of instructions for these registers.
Constructing values (particularly float values) directly into these registers is not as easy to do — it’s definitely the case that GCC didn’t know of a way to efficiently construct the constants I chose, and copying between different types of registers often requires data be copied to memory (L1 cache, probably) and back.
You can also see that PowerPC requires an extra instruction to load the vector than for a float — the instruction set for vectors lacks a load instruction that permits an immediate offset.
For some constants, the compiler can be smarter (I’ve seen GCC do the occasional trick for SPU that surprised me). When the compiler can’t (or won’t), there’s things like this: @Sheredom’s original question.
When all you’ve got is nails, you don’t need a screwdriver
For SPU, a register is a register — int or float it can be constructed just as easily. Fetching from local store takes only six cycles, so loading makes sense for every other case.
The end is here
Constants. I’ve written about them. Hopefully, you have questions — grab your compiler and check my working :)
Leave your questions, comments, discoveries and/or corrections below — and let me know what you think of this post (and the whole series).
[0] That’s not to say that none of these architectures have identical load instructions — just that these particular ones all are slightly difference.
[1] constantish.
[Photos by Karen Adamczewski]