Evolved compression in the x86 ISA

Instapaper Text

Evolved compression in the x86 ISA

Or, the post in which I say something nice about x86.

Everybody who deals directly with assembly loves to hate on x86. Including me. Compared to other CPU ISAs it’s clumsy, everything about the registers is obnoxious, the constant pushing and popping can make disassembly hard to follow, and don’t even get me started on little-endianness.

But there’s one thing that has always fascinated me about x86… and that’s the fact that code generated for it is often significantly smaller than the corresponding code for other CPUs of similar power.

Why is this so? Well, the x86 instruction set is, in a way, a sort of naturally evolved compression format.

Why does code size matter?

Before we start, you might reasonably ask: why do you even care about code size? Don’t we mostly care about performance?

Bigger isn't always better.

Well, code size can affect performance significantly. All else being equal, smaller code is sometimes preferable – it has less impact on the L1 cache, less time may be spent on instruction fetch, and so on.

Smaller code can also be important on game consoles – with a fixed amount of RAM and no virtual memory, every byte not spent on code is a byte you can use for something else. In studies I’ve seen on games, something as simple as compiling with -Os instead of -O3 generated code that was several megabytes smaller for some games.

Variable-length coding in compression

Okay, so first: one of the basic concepts of lossless data compression is Lempel-Ziv. Variable-length coding is at the heart of most popular compression formats including DEFLATE (zlib), LZMA, LZO, bzip2, JPEG, and MP3.

The basic idea behind variable-length coding in compression is that you want to use smaller representations for things that occur often, and larger representations for things that are uncommon.

That’s a key point. Keep that in mind as we move on…

Variable-width instructions in x86

As it happens, x86 instructions are also variable size. They can be anywhere from one to more than dozen bytes depending on the parameters and addressing modes.

Let’s take a little trip back in time. The x86 ISA got its start with the Intel 8086 in 1978. The 8086 supported 8-bit and 16-bit integer operations: math, memory access, compares, branches, and so on. All of the original 8086 opcodes are just one byte, with (in the typical case) 0-2 extra bytes for parameters.

The designers of x86 didn’t explicitly set out to make the ISA size-efficient: there are lots of useless or rarely used one-byte opcodes (such as HLT, used to stop the processor). Still, the fact that basic integer math and branching existed in the original 8086 ISA means that they sit in the prime slots in the opcode table.

Soon after the release of the 8086, floating-point instructions were added to the ISA for the 8087 coprocessor. The single-byte opcodes were mostly already used, so the FPU instructions were generally two-byte opcodes.

From then until now, the x86 ISA has been extended many times for 32 bit integers, 64 bit integers, SIMD vectors, and more. With all of those extensions, opcodes got even longer: reaching 3 bytes with AVX.

So you can see how the x86 opcode table evolved over time. The basic integer stuff dates back to the original 8086 and is generally quite small. Scalar floating-point is a bit larger, and vector and advanced algorithmic instructions are the largest.

Putting it all together: Instruction Mix

Here’s the last piece of it.

So, we know that the x86 ISA can represent some operations very efficiently (1 – 4 bytes), but other operations very inefficiently (5 – 15 bytes). Is it really a more efficient representation of code than other ISAs which used fixed-width instructions?

A big part of the answer depends on the instruction mix. By instruction mix, I mean the typical mix of different kinds of instructions in the system. What percentage are integer ops? What percentage are floating point? What percentage are vector?

If you’re a game programmer, you might be thinking “why, of course the majority of my code is floating point or vectors!”

But guess what: it’s not.

Even in the most heavily-optimized game code, quite a lot of it by weight will be function calls, boolean logic, memory loads and stores, lightweight integer math, string manipulation, and so on.

And here’s the key – these are all things which the x86 ISA can represent very efficiently in a small number of bytes!

The net result is that the x86 ISA, with its variable-width instructions and oft-extended opcode table, acts as a sort of naturally evolved data compression compared to fixed-width ISAs.

It’s clumsy and imperfect, as you might expect from the accidental result of evolution. Perhaps most ironically, a heavily-optimized inner loop that runs entirely vector operations will be the thing that suffers the most code-size bloat due to the ISA.

Head to Head FIGHT: x86 vs PowerPC

I’ll wrap up with a few representative code snippets comparing x86 and PowerPC on the same functions. (Why these two ISAs? Because they’re the high-end chips that we find in PCs and game consoles today, of course.)

Don't worry. These chickens are fighting over who's the most delicious.

To make the test as equal as possible, all were compiled with the same version of gcc, optimization set to -Os.

switch.c – A simple switch, lots of compares.
- switch.ppc.S = 200 bytes
- switch.x86.S = 134 bytes (33% smaller)
memcpy.c – A trivial implementation of libc memcpy.
- memcpy.ppc.S = 48 bytes
- memcpy.x86.S = 43 bytes (10% smaller)
memset.c – A trivial implementation of libc memset.
- memset.ppc.S = 32 bytes
- memset.x86.S = 32 bytes (same)
appendpath.c – String manipulation, including some function calls.
- appendpath.ppc.S = 244 bytes
- appendpath.x86.S = 154 bytes (37% smaller)
djb.c – Daniel J. Bernstein’s popular string hash
- djb.ppc.S = 44 bytes
- djb.x86.S = 39 bytes (11% smaller)
fsqrt.c – Paul Hsieh’s floating point square root estimate
- fsqrt.ppc.S = 128 bytes
- fsqrt.x86.S = 210 bytes (64% larger!)

Now, of course it gets more complicated than just these simple examples, but it’s quite instructive to see that the x86 code winds up being smaller or at worst equal in all of the integer-based examples.

Of course, more complex code like FPU and vector operations can wind up significantly larger, and that part kinda sucks. This pain can be lessened since you get the CPU to do more and more work with the complicated instructions, but you’re always going to be fighting against it a bit.

A parting thought

Wouldn’t it be cool to see a variable-width ISA where the opcode table is explicitly designed (rather than accidentally evolved) to yield the smallest code?

It doesn’t necessarily mean picking the most common instructions to be the smallest — although that’s part of it. But they might also give a bunch of the prime slots to common inner-loop operations like vector and FP math.

Of course, I don’t know if it would necessarily be worthwhile. Small code size definitely doesn’t imply performance; you might go through the whole exercise only to find that your ISA can’t be implemented efficiently. For example, ARM (and Thumb) can yield even smaller code sizes than the ones above, but these CPUs typically can’t be clocked nearly as high as x86 and PowerPC are.

But it’s a fun thought experiment!

#AltDevBlogADay

Drew Thaler