Instapaper Text

Windows x64 ABI: Stack Frames

In the previous post, x64 ABI: Intro to the Windows x64 Calling Convention, I explained the rules of the Windows x64 calling convention. Now we’ll take a look at what makes up a stack frame.

Disclaimer

All of the disassembly I’ll show you is from Visual Studio 2012 Express. Comments I make about MSVC are referring to that version as well. If you’re following along at home with 2010, you won’t see any differences with the compilation of my example code. I’ve seen 2012 store non-volatile registers in function prologues and take advantages of tail calls more often, but these are simple examples and result in the same disassembly so long as you use the same compiler options.

Stack Frame Anatomy

This is what the common Windows x64 stack frame looks like without optimizations. The stack grows upwards, and the numbers in the block are the sizes that it can be:

The space between the dotted lines is the stack frame, the memory a function owns on the stack. Each function is responsible for creating and destroying its own stack frame. They are also responsible for the memory used by parameters they pass to other functions, so parameters are part of the calling function’s stack frame. This means a function will have to access the memory below its own stack frame to get at its home space and parameters numbers 5 and up (read the last post for a refresher on what the home space is all about).

The Stack Pointer

The stack pointer, RSP, doesn’t move around very much. It will only change when functions that dynamically allocate stack space are called (like alloca) and during function prologues and epilogues which create and destroy the stack frame respectively (read this for a definition of prologue and epilogue, the body is everything in between). In the body of a function, RSP will usually be 16 byte aligned. The promise that needs to be kept is that the alignment of RSP will be known when you enter a function. If you don’t call another function, there’s no reason to keep that promise and the compiler can put RSP on whatever alignment it wants.

RSP won’t be on 16 byte alignment after the call into a function because calling a function pushes an 8 byte return address on the stack. But since it was aligned before the call, its value is still deterministic. RSP will always have a value ending in 0×8 in the prologue. Same goes for the epilogue when it restores RSP back to that value before returning (ret pops the 8 byte return address off of the stack). I like to call this a 16+8 byte alignment. There might be an official fancy term for it that already exists, but I don’t know it.

To summarize: functions that call other functions must always have a 16 byte aligned stack pointer in their body. Functions that don’t call any other functions don’t have to keep the stack pointer on any specific alignment.

Alloca makes 16 byte aligned allocations. The top of the local variable block is also aligned to 16 bytes. 16 is the number though shalt align to, and the number for the alignment shall be 16. That’s the reason for those orange blocks, the padding to keep these blocks of memory aligned. The one higher on the stack only exists if the maximum number of parameters passed to functions is greater than 4 and is an odd number. The existence of the one lower on the stack depends on the size of the non-volatile register save block (including the frame pointer) and the size of the local variables block. Its home is between those two blocks.

Sometimes that lower orange block is 16 bytes, and I don’t know why. The cases where I’ve seen it happen, no saved registers with 24 bytes of local variables is the simplest, the compiler could have had a 24 byte local variable block with the top aligned to 16 bytes and no padding at all (the last 8 bytes would be taken care of by the return address pushed on the stack). Maybe there is no purpose, and this is room for improvement for the compiler.

Aligning all of these pieces might be the best thing since sliced silicon! It lets the compiler easily align local variables. It can also easily determine if aligned SSE load/store instructions can be used or not. If RSP wasn’t aligned, the compiler would have to always use the unaligned instructions or add alignment code to the prologue of certain functions. Aligning every stack frame is an elegant solution, and I think the extra few bytes of padding is a small, worthwhile price to pay for the ability to use optimal SSE instructions. Those instructions can handle data in any format, not just floating point numbers. You might see the compiler use them to move 16 bytes of data quickly. Since the SSE registers are sure to exist and stack data can be aligned, there’s no reason to use the x87 FPU instructions (unless you REALLY want that precision). For x87 the bell tolls. Time marches on.

The Frame Pointer

The frame pointer is not always used. If the stack pointer doesn’t change within the body of a function, it’s not necessary to remember where it was previously. The function epilogue already knows how to restore it before returning. If you use alloca, RSP will jump around and RBP will be used as a fame pointer. Instead of remembering how far RSP moved during the body of the function, the compiler uses the frame pointer as an anchor point to restore RSP before returning. RBP is always the register chosen to be the frame pointer.

I’ve also seen MSVC use a frame pointer for reasons that I am admittedly uncertain of. From what I can tell, the frame pointer will be used if a function needs to pass enough struct parameters by-value and the amount of stack space needed for the temporary objects is within a certain range (the structs section of the last post explains those temporary objects). I think there’s a required ratio between the number of params and their size. IE: 6 structs of 8 bytes, but not 3 structs of 16 bytes. This magic is interesting, but the specifics about how it works are not important for this post. The lesson to take away here is that MSVC may decide a frame pointer is necessary for reasons other than the use of alloca.

**UPDATE 10-2-2012**
In the above situations, I think the compiler decides to use the frame pointer to avoid using longer MOV instructions.

 
  mov    rax,qword ptr [rsp+100h];7 byte instruction
 
  mov    rax,qword ptr [rbp-4h]  ;4 byte instruction

So the compiler decides that the benefit of using the frame pointer outweighs the benefits of having an extra register. Since there are a good number of regsiter, I can’t argue against that. Way to go MSVC compiler team!

I’ve noticed similar behavior when passing literal constants to a function, but didn’t know why it was doing it until now. Make a function with enough parameters in there so some have to go on the stack. Just by changing the values you pass, you can see the instructions change. You may find it move a constant directly to the stack, or move to a register first and then move the register to the stack. It may not always do so because the size of the instructions is smaller, but because it can use 2 separate instructions that can be piped together by the CPU.
**END UPDATE**

Predicting where the frame pointer will point to is a bit tricky. It often points to the top of the local variable block, but it can be anywhere within that block. If that block doesn’t exist, it would be pointing to the top of the saved non-volatile register block (which I guess you could argue is the top of the 0 sized local variable block). There’s no alignment requirement for the frame pointer, so don’t be surprised if you see it with a low byte of, say, 0×9.

Some folks may use the term base pointer instead of frame pointer, and since EBP might point somewhere in the middle of the stack frame I’d say the former term is more fitting for x64. I’m going to continue calling it a frame pointer though, because that’s what I’ve always called in and that’s what the MSDN calls it. That’s the way it is, and that’s the way I likes it, dagnabit!

Since the frame pointer isn’t always used, the linked list of frame pointer and return address pairs going up the callstack doesn’t exist like it did on x86. It sort of exists if every function used the frame pointer, but it still wouldn’t be as easy to follow since there’s a varying amount of space between the address it points to and the address where the last frame pointer is stored (if you have no local variables and only RBP is saved, you’ll get lucky and have the same setup as x86).

It’s still possible to follow the callstack yourself with or without a frame pointer, though. Once you find the prologue or epilogue for a function, you should be able to calculate where the return address is as well as the bounds of the stack frame. With the stack frame diagram in your hand and the song of debugging in your heart, you should be able to find everything on the stack.

Non-Volatile Register Block

The difference between volatile and non-volatile registers was briefly explained in the last post. The non-volatile registers need their values stored before a function uses them, and this block is where they go. It won’t store all the registers. Only the ones it will wind up using will be stored by the prologue and then restored by the epilogue.

The integer non-volatile registers are usually stored on the stack with a push instruction that happens before the stack grows (sometimes they’re moved the home space instead, and that can happen before or after the stack grows). If RBP is used as the frame pointer, it will always be pushed and it will always come before the others. That’s a handy rule to have because it keeps the old frame pointer right next to the return address. I’ve always seen these registers pushed in a particular order: RBP, RBX, RSI, RDI, then R12-15 sequentially.

The SSE registers can’t be pushed, so they will always get moved to this block after the stack grows. I’ve noticed that the order in which they’re moved follows the register numbering order as well.

Local Variables Block

This is where local variables go on the stack, but there are some other bits that go here, too. If you take a data type that is greater than 8 bytes (structs or SSE types) and pass them as by-value parameters, then the calling function will create temporary objects that’s placed in this block (getting that temporary to the called function was explained in the last post). You can think of these temporary objects as local variables that the compiler stuffs in there for you, which is one reason why there isn’t a separate block in the diagram for them. The same thing happens for temporary return objects that are greater than 8 bytes.

There can be alignment padding sprinkled throughout this area, too. Data types that are less than 8 bytes data types can be packed tightly together by the compiler, so padding should be put to good use when able. Don’t expect the order you declare your variables to dictate where they are on the stack. That’s another reason why there’s no “temp object” block. There would be a variable number of temp and local variable blocks weaved between each other, and that makes for a very messy picture.

I’ve yet to see a function preserve volatile registers before making a function call. If it needed to, my gut tells me that there would be space in this local variables block for saving those registers. Just like the trick for large pass-by-value data types, it’s another temporary variable added by the compiler. In my experience, the compiler has always been able to use one of the non-volatile registers to preserve values across a function call. It certainly could store volatile registers before a call if it wanted to. But in order to force that situation, there would need to be an instruction that operated on more than the number of non-volatile registers at once (forcing the use of a volatile register to store one of the required values). So long as such an instruction doesn’t exist, it’s possible for the compiler to avoid the situation entirely. The MSDN doesn’t explain how these registers are saved, which gives me more confidence that the situation won’t come up.

Home Space and Extra Parameter Block

These blocks are contiguous, so all the parameters are right next to each other in order. A function will always keep these at the top of the stack so that they are directly below a called function’s stack frame.

They’re added to the stack frame in the function prologue, and always exist at the top of the stack frame. So if alloca is used within a function they wind up moving. That means that the address of these blocks for one called function will be different than the address for another called function if there’s an alloca call in between. This isn’t a super important fact to remember about the stack frame, but it may help you on your debugging adventures.

Stack Frame Size

If you have a very boring function, it may not have a stack frame at all. It has to be VERY boring though: no local variables, no calls to other functions, no dynamic stack allocations, and no use of non-volatile registers. Each one of those requirements eliminates the need of one of the blocks in the stack frame, so if you satisfy all of them the whole frame can go away. Sure, you still have the return address, but is that enough to call it a stack frame? I would argue that it is not. These kinds of functions are called leaf functions. Break any of those conditions, and you got yourself a stack function.

Brief aside: The definition for a leaf function above is the definition you’ll find in the GNU documentation. This confused me at first, because I figured a leaf function would only have the attribute of not calling any other functions. I think the definition these compilers use is silly, but I will conform to their definition to be consistent.

The size of the stack frame varies, but it will only be made up of the parts in the diagram at the beginning of this post. Since RSP doesn’t often move in the body of a function, you can usually look at the prologue to see how much it adds to RSP and count the number of pushes before it to get the stack frame size. If you use alloca, add up the sizes and don’t forget to account for alignment padding.

The space for parameters number 5 and up for called functions, the lighter blue block in the diagram, is allocated up front in a calling function’s prologue. The compiler goes through a function, looks at all the function calls made, and takes the one with the most parameters. That’s the number it will use when determining how big this block should be. The home space must exist in a function’s stack frame if it calls another function. Otherwise, it can be removed entirely even in unoptimized builds.

If you recall, I had said that every function call requires at least 48 bytes on the stack. Let’s look at a function that does nothing but call another function with no parameters:

//important compiler options: /O2 /Ob0 /GS- /Gy- /GL-
 
  
 
  __int64 g = 0;
 
  void NoParams()
 
  {
 
    ++g;
 
  }
 
  
 
  int main()
 
  {
 
    NoParams();
 
    return 0;
 
  }

And the disassembly:

NoParams: 
 
    inc    qword ptr [g] 
 
    ret 
 
  
 
  main:
 
    sub    rsp,28h
 
    call   NoParams
 
    xor    eax,eax
 
    add    rsp,28h
 
    ret

We’ve taken out all of the optional stuff, and we’re almost able to call this function VERY boring based on our definition from earlier. But lo! It adds 40 bytes of stack just to make that call. 32 of those bytes are for NoParam’s home space. The other 8 bytes are padding to align RSP before the call to NoParams. As explained earlier, main knows that RSP was on a 16 byte boundary before main was called. Calling main pushed an 8 byte return address onto the stack, so adding 8 more bytes puts it right back on the 16 byte alignment.

Add all that up and you get our magic number: 8 bytes of padding to align RSP + 32 bytes home space + an 8 byte return address = 48 bytes. After the call into NoParams, RSP will be 48 bytes further up the stack than it was in the beginning of main. NoParams is a leaf function, so it doesn’t even bother creating a stack frame.

Optional Optimizations

Smart Use of the Home Space

If you turn on the optimization flags, the structure of the stack frame will look the same. The most important thing the optimizer will change is that a function might use its own home space for local variables and non-volatile register storage instead of saving the parameter passing register values. In the eyes of the optimizer, it’s 32 bytes of space that’s always there. It doesn’t even have to worry about cleaning it up. A great man once said, “can’t someone else do it?”

Frame Pointer Omission

When MSVC determines that a frame pointer is necessary, you can’t get rid of it. You can’t make it use a frame pointer, either. MSVC disregards your FPO optimization settings. GCC will also disregard your plea to remove it, but will add frame pointers to every function if you tell it to. I haven’t seen GCC use a frame pointer on its own unless you allocate from the stack.

Tail Calls

If the very last thing a function does is call another function, that call is referred to as a tail call. When the tail call returns, the calling function would only need to run its epilogue. Because the stack is structured so rigidly, an x64 compiler can take advantage of tail calls and make an optimization called tail call elimination more often than x86 compilers. Let’s change our previous example code a bit to see how it does that.

Please ignore the fact that the array is not initialized. I’ve only done that to reduce the disassembly we need to look through.

//important compiler options: /O2 /Ob0 /GS- /Gy- /GL-
 
  #include <stdio.h>
 
  
 
  __int64 g = 0;
 
  void NoParams()
 
  {
 
    ++g;
 
    printf( "%I64d", g );
 
  }
 
   
 
  void Shim()
 
  {
 
    volatile __int64 space[5];
 
     g = space[4];
 
     NoParams();
 
  }
 
   
 
  int main()
 
  {
 
    Shim();
 
    return 0;
 
  }

The disassembly:

NoParams: 
 
    mov    rdx,qword ptr [g]
 
    lea    rcx,[string "%I64d"]
 
    inc    rdx
 
    mov    qword ptr [g],rdx
 
    jmp    qword ptr [__imp_printf]
 
   
 
  Shim:
 
    sub    rsp,58h
 
    mov    rax,qword ptr [rsp+40h]
 
    inc    qword ptr [g],eax
 
    add    rsp,58h
 
    jmp    NoParams
 
   
 
  main:
 
    sub    rsp,28h
 
    call   Shim
 
    xor    eax,eax
 
    add    rsp,28h
 
    ret

Let’s focus on the Shim function: the call into NoParams was changed to a jump, and Shim cleans up its stack frame before the jump. How does this not break the promise of home space for NoParams? Think about where RSP will wind up. It’s going to be in the same spot as it was at the start of Shim, and it will point to the return address that will take us back into main. There’s also 32 bytes on the stack below RSP for Shim’s homes pace. NoParams can reuse that for its own home space. It’s as if Shim never even existed. Ta da…a nifty little magic trick!

NoParams takes this “I wasn’t here” trick to the next level. Not only does it jump to printf, it forgoes creation of its own stack frame entirely. I’ve only seen this happen when tail call elimination is used. For example: if you move the “++g” after the call to printf, tail call elimination cannot happen and NoParams will create a stack frame.

So long as the number of parameters for the tail call are less than or equal to the calling function’s, then this optimization can be used. Destructors don’t mess this up either, though the tail call will wind up being one of the destructors rather than the function you typed at the end. The compiler needs to keep function order the same, and the destructors are guaranteed to come last.

Tail call elimination is a wonderful optimization! The downside is that function that makes the tail call won’t show up in the callstack. Since there’s no return address into it, a debugger has no idea it was called. If you see a missing call, don’t suspect the callstack is corrupt. Double check your calls and you’ll probably find a call to the function that isn’t showing up in your callstack.

Go forth and compile!

I encourage you to experiment on your own to see how the compiler makes use of the stack frame. Try different data types, including structs. Watching it move blocks of data around with SSE instructions is kind of fun, and there are some interesting patterns in the way it chooses which variable goes where on the stack. Stepping through optimized x64 disassembly is almost soothing compared to doing the same with x86 disassembly. Data flows from the stack to registers and back so smoothly, and it’s always easy to find your footing with a stable, unmoving RSP register. If you find any particularly cases where the compiler does some interesting things, let me know!

#AltDevBlog

Rich Skorski