Hello interwebs! As the title suggests this is the 6th part of the C / C++ Low Level Curriculum series I’ve been doing. In this installment we’ll be starting to look at conditional statements, and what the code that you’re asking the compiler to generate when you use them looks like (at least before the optimiser gets to it…).
Just in case anyone is unclear about what they are, conditionals are the language features that allow us control over which parts of our code get executed. At face value, the subject of conditionals might seem a simple one, but it is precisely because it seems simple – and because so much else builds on top of it – that it is the first topic that I’ve chosen to look at in detail after function calls.
Though we won’t get around to all of them in this post, our look at conditionals will take us on a tour through a representative sample of x86 disassembly generated by if statements, the conditional operator (or “ternary operator”, or “question mark”), and switch statements; and whilst we look at all of these we’ll also be looking at disassembly generated by the (built in!) relational and logical operators that are used with them (i.e. ==, !=, <=, >=, >, <, !, &&, and ||).
Prologue
Firstly, I’d like to apologise to anyone who reads these posts regularly for the fact that my rate of posting has slowed down – I will hopefully speed up again to the regular 2 week posting cycle in the near future.
Secondly, here are the backlinks for anyone who wants to start from the beginning of the series (warning: it might take you a while, the first few are quite long):
- /2011/11/09/a-low-level-curriculum-for-c-and-c/
- /2011/11/24/c-c-low-level-curriculum-part-2-data-types/
- /2011/12/14/c-c-low-level-curriculum-part-3-the-stack/
- /2011/12/24/c-c-low-level-curriculum-part-4-more-stack/
- /2012/02/07/c-c-low-level-curriculum-part-5-even-more-stack/
Generally I will try to avoid too much assumed knowledge; but if something comes up that I’ve explained previously, or that I know another ADBAD author has covered already then I will just link to it; this implies that you, dear reader, should assume that I assume you will read anything I link to if you want to make complete sense of the article :)
Compiling and running code from this article
I assume that you are using Windows, are familiar with the VS2010 IDE, and comfortable writing, running, and debugging C++ programs.
As with the previous posts in this series, I’m using a win32 console application made by the “new project” wizard in VS2010 with the default options (VS2010 express edition is fine).
The only change I make from the default project setup is to turn off “Basic Runtime Checks” to make the generated assembler more legible (and significantly faster…) see this previous post for details on how to do this.
To run code from this article in a VS2010 project created this way, open the .cpp file that isn’t stdafx.cpp and replace everything in it with text copied and pasted from the code box.
The disassembly we look at is from the debug build configuration, which generates “vanilla” unoptimised win32 x86 code.
Instructions and Mnemonics: an aside
I’ve just realised that so far in this series I have typically been using the term instruction when referring to an assembler mnemonic.
I felt that I should point out that this isn’t 100% accurate, because whilst assembler mnemonics are normally thought of as having a 1:1 correspondence to binary CPU instructions, they are not actually instructions.
In fact, in x86 assembler, the menemonics often actually have a 1:x relationship with the corresponding opcodes, because multiple variants of each mnemonic exist that differ in the types and sizes of their operands.
This is not something you should worry yourself about too much, as it’s a fairly harmless Kenobiism, but I still felt I should point it out if I was going to carry on doing it ;)
Conditionals
The best place to start is, as someone or other famously once remarked, at the beginning; so let’s start with the most basic form of the if statement.
Before anyone mentions it, I know I could have omitted the curly braces around iLocal = 1; on line 9. If you’re the kind of person who’s so lazy that you like to leave out curly braces in these situations then that’s up to you; but I would just like to point out that there is probably a special place in one of the deeper and less pleasant circles of the Hell I don’t believe in that is reserved for your sort – just a couple of floors up from those who do the same thing with loops.
Also, I’ve left the #inlcude “stdafx.h” in the code box so that your line numbers match mine if you’re working through this yourself.
1 2 3 4 5 6 7 8 9 10 11 12 13 | #include "stdafx.h" int main(int argc, char* argv[]) { int iLocal = 0; if( argc < 0 ) { iLocal = 1; } return 0; } |
Anyway, as usual if you’re looking at this in VS2010 then copy and paste the above code over whichever is your project’s main .cpp file, put a breakpoint on line 7, tell Visual Studio to compile and run, wait for the breakpoint to be hit, then right click in the source window and choose “Go To Disassembly”. You should now be seeing something like this:
As we already know the assembler above int iLocal = 0; is the function prologue (or preamble) and the assembler after the closing brace of main() is function epilogue.
The specific disassembler we’re interested in is between lines 7 and 13 of the source code that is shown inline with the disassembly, so here it is pasted into a code window (N.B. the addresses corresponding to the disassembly instructions will almost certainly differ on your screen if you’re running this yourself…)
1 2 3 4 5 6 7 8 9 10 11 | 7: if( argc < 0 ) 010D20B0 cmp dword ptr [argc],0 010D20B4 jge main+1Dh (10D20BDh) 8: { 9: iLocal = 1; 010D20B6 mov dword ptr [iLocal],1 10: } 11: 12: return 0; 010D20BD xor eax,eax 13: } |
Straight away, there are a couple of new assembler mnemonics we’ve not come across so far in this series of posts. We’ll cover these as we come to them.
line 2 is comparing argc against 0. The instruction cmp doesn’t have an instant effect on code execution, it compares its first and second operand and stores the result of the comparison in an internal register of the CPU known as EFLAGS.
line 3 uses the mnemonic jge, which means jump greater equal. It will cause a jump to the address 0x010D20BD supplied as its operand if the outcome of the previous cmp instruction has set the content of the EFLAGS register to indicate that its first operand was greater than or equal to its second operand – i.e. if argc is greater than or equal to 0 then execution will jump past the instructions generated by the block of code controlled by the if.
Hold on a minute…
So, we’ve only covered the most basic form of an if statement and we’ve already encountered a major difference between what we might think we’re asking the compiler to do, and the code it’s generating.
The intuitive way to think about an if block in a high level language is that if the condition of the if is met, then execution will step into the curly braces delimted block of code it controls.
However, the assembler is clearly testing the logical opposite of what we’ve asked it to, and if that condition is met then it is skipping over the code block controlled by the if.
This is because, at the assembler level, instructions are executed in sequential order unless a jump instruction tells it to do otherwise – and so assembler has no equivalent to the high level concept of a curly brace delimited “code block”. The upshot of this is that the high level notion of “stepping into” a code block is implemented at the assembler level by “not skipping over” the code the block has generated.
Clearly these two behaviours are logically isomorphic (i.e. produce the same output given the same input), but the high level version is easier for the human mind to cope with intuitively, and the version generated by the compiler better suits the sequential-execution-unless-tampered-with behaviour of the underlying machine.
Just for the sake of clarity let’s re-write the C++ code in a form that matches what the assembler we just looked at does, using the C++ keyword goto:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #include "stdafx.h" int main(int argc, char* argv[]) { int iLocal = 0; // corresponding original code in comments to the right... if( argc >= 0 ) goto GreaterEqualZero; //if( argc < 0 ) //{ iLocal = 1; // iLocal = 1; //} GreaterEqualZero: return 0; } |
NOTE: Ironically (though unsurprisingly) this C++ code generates different assembler to the original code. Please don’t worry about this.
if … else if … else
So let’s take a look at a more complicated if construct:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | #include "stdafx.h" int main(int argc, char* argv[]) { int iLocal = 0; if( argc == 0 ) { iLocal = 13; } else if( argc != 42 ) { iLocal = (6 * 9); } else { iLocal = 1066; } return 0; } |
This code generates the following assembler, which given what we saw in the previous example is more or less exactly what you’d expect:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | 7: if( argc == 0 ) 002020B0 cmp dword ptr [argc],0 002020B4 jne main+1Fh (2020BFh) 8: { 9: iLocal = 13; 002020B6 mov dword ptr [iLocal],0Dh 002020BD jmp main+35h (2020D5h) 10: } 11: else if( argc != 42 ) 002020BF cmp dword ptr [argc],2Ah 002020C3 je main+2Eh (2020CEh) 12: { 13: iLocal = (6 * 9); 002020C5 mov dword ptr [iLocal],36h 14: } 15: else 002020CC jmp main+35h (2020D5h) 16: { 17: iLocal = 1066; 002020CE mov dword ptr [iLocal],42Ah 18: } 19: 20: return 0; 002020D5 xor eax,eax |
The main things to note about this code are:
- Each if and else if condition is implemented as a cmp followed by a jxx – there are two new ones in here: je (jump equal) and jne (jump not equal)
- As in the first example, each if and else if condition is causing the compiler to generate the logically opposite test to the high level language, and skipping the assembler generated by the controlled block of code if it succeeds
- The test for the first if jumps to the condition of the else if when its condition is not met. If there were more chained else if statements then this pattern would continue through them.
- Each block of code has an unconditonal jmp at the end of it that takes the execution past the code block controlled by the else
That was all pretty straightforward for once. Joy.
Next, let’s take a look at the effects of the && and || operators:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #include "stdafx.h" int main(int argc, char* argv[]) { int iLocal = 0; if( ( argc >= 7 ) && ( argc <= 13 ) ) { iLocal = 1024; } else if( argc || ( !argc ) || ( argc == 69 ) ) // deliberately nonsensical test { iLocal = 666; } return 0; } |
This generates the following assembler, which is much more interesting than the first if … else if example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | 7: if( ( argc >= 7 ) && ( argc <= 13 ) ) 00F120B0 cmp dword ptr [argc],7 00F120B4 jl main+25h (0F120C5h) 00F120B6 cmp dword ptr [argc],0Dh 00F120BA jg main+25h (0F120C5h) 8: { 9: iLocal = 1024; 00F120BC mov dword ptr [iLocal],400h 00F120C3 jmp main+3Eh (0F120DEh) 10: } 11: else if( argc || ( !argc ) || ( argc == 69 ) ) 00F120C5 cmp dword ptr [argc],0 00F120C9 jne main+37h (0F120D7h) 00F120CB cmp dword ptr [argc],0 00F120CF je main+37h (0F120D7h) 00F120D1 cmp dword ptr [argc],45h 00F120D5 jne main+3Eh (0F120DEh) 12: { 13: iLocal = 666; 00F120D7 mov dword ptr [iLocal],29Ah 14: } 15: 16: return 0; 00F120DE xor eax,eax 17: } |
Now, I don’t know about you but the first time I saw assembler generated by using && and || I was amazed by the sheer simplistic audacity of it – I think it’s because I’m not an assembler programmer, but I expected it to be a little more complicated and fiddly than this.
Looking in detail at the code generated for the if statement using && (lines 2 to 5), we can see that is using another two conditional jump instructions we’ve not yet seen: jl (jump less) and jg (jump greater) and as before is testing the logically opposite condition to that specified by the high level code.
More interestingly, in order to implement &&, the compiler simply concatenates the separate tests – if either of these tests fails they will cause execution to jump past the block of code controlled by the if statement. This means that the block of code controlled by the if will only be executed if both tests are passed, which clearly implements a logical AND.
If we now turn our attention to the code generated by the if statement using || (lines 12 to 17) we see a similar pattern of consecutive conditional tests, though clearly it must be different since it implements conditions joined by ||.
The first thing to notice is that the first two tests done by the assembler are logically the same as their high level equivalents. This bucks the trend we have seen so far, but why?
Well, the address passed as operands to the conditional jumps on lines 13 and 15 will move execution past the rest of the tests, to the start of the controlled code block. Unsurprisingly though, the last test of the || if statement (lines 16 & 17) follows the standard test-the-opposite-and-jump-past idiom we’ve come to expect from an if statement.
The jump-into-controlled-block behaviour of all but the last || conditional means that as soon as any one of the tests is passed the controlled code will be executed, which clearly implements a logical OR.
Aside: Lazy Evaluation
I’m sure that most – if not all – of you will have heard that C++ has “lazy evaluation” of && and ||. If you’ve never been 100% sure of what this means, you’ve just seen it in action in this block of assembler!
The && will fail if either of its operands fails; so if the first test fails it will never do the second (or third, or fourth …).
Similarly the || will succeed if either of its operands succeeds; so if the first test passes it will never do the second (or third, or forth …).
Since neither necessarily evaluates all of its operands this makes them technically “lazy”; which in this circumstance you can read as awesome, elegant, and efficient (for certain definitions of efficient).
Summary
The main points to take away from the assembler we’ve looked at in this post are that:
- The conditional test that you see in the disassembly is likely to be the logical opposite of the test the high level code is asking for…
- …and the conditional jump will typically be jumping over the assembler that is generated by the “code block” controlled by the conditional in the the high level code.
- This is because there is no concept of a “code block” at the level of assembler.
More or less all control code boils down to various combinations of conditionals and jumps at the assembly level; and being familiar with the assembler mnemonics that are used to implement these C / C++ features, and the various ways that they are used will almost certainly prove invaluable when you find yourself in the unenviable situation of a crash deep within some library code that you don’t have symbols for (or that your debugger can’t find symbols for).
Incidentally if you find yourself lost in code that you should have symbols for but your machine refuses to find them, you might try this post by Bruce Dawson to see if it helps ;)
Next time we’ll continue looking at conditionals with the conditional operator (also known as the “ternary operator” or more commonly the question mark), and the the switch statement.
Also, thanks to Fabian and Bruce for giving this a once-over and offering sage advice on content.
Disclaimers
I am pretty sure that the code in this article doesn’t demonstrate all the relational operators; so I’m leaving it to you, dear reader, to try out the ones I left out to see what they do :)
I also avoided writing any conditions for the if statements that contained function calls, clearly this will make the assembler generated by the test code significantly more complex and assuming that you have read the previous posts on the assembler generated when calling functions too you should be able to make sense of this by yourself. I have to admit that I also partly avoided doing this so I could steer clear of operator overloading. That’s for later. Probably.