I love C++. I’m a big fan of it’s expressiveness. I think carefully constructed abstractions can be enormously powerful, and that C++ gives you the tools to construct those abstractions efficiently.
There are however, some problems that C/C++ simply can’t solve well. Sometimes the best tool for the job isn’t going to be your favourite tool, and rather than try to twist the tool you like to the job you need to do, you should leave your comfort zone and try something new.
This article explores once such problem, and what happened when I left my comfort zone.
This problem came up on a C++ mailing list I subscribe too:
I’m working on a problem where integer multiplicative overflow is plausible and I want to check for it.
The poster had already constructed a solution that looked like this:
1 | overflow = (a!=0 && (a*b)/a!=b) |
Given that mathematically: (a*b)/b == a, the poster was concerned that the optimiser would spot that and eliminate the test. He also wondered if there were better ways of doing it.
The question prompted a fair amount of discussion. The general consensus was that the test worked, and that the compiler would not optimise it away.
Some other solutions were proposed, none of them unambiguously better.
But it got me thinking. Given that every CPU architecture that I have any experience with has an overflow bit, shouldn’t we be able to check that? A divide is relatively expensive, and the test includes a compare and a branch to set a boolean that will then most certainly be used in a compare and a branch.
Wouldn’t it be better to just check the overflow bit?
So I started with this as my baseline:
1 2 3 4 5 6 7 | // returns true (non-zero) if the result of a*b is within the valid range of an integer // the value of the multiplication is stored in the address pointed to by result BOOL SafeMultiply( int a, int b, int* result ) { int c = a * b; *result = c; return (a==0 || c/a==b); } |
Which can be used like this:
1 2 3 4 5 6 | if (SafeMultiply(a,b,&c)) { // do something with c } else { // report an error } |
I then spent about an hour relearning some assembly code. I knew it could be done, but the specific opcodes, mnemonics, and inline-asm syntax were fuzzy. It’s been probably 16 years since I actually wrote any assembler. Debugging it, sure, looking at the generated code, yes. But writing assembler just hasn’t come up in my work for a very long time.
I wrote the first version on my lunch break at work, so I used MSVC, and thus MASM syntax inline assembler.What I wrote looked something like this:
1 2 3 4 5 6 7 8 9 10 11 12 | int SafeMultiply( int a, int b, int* result ) { char good; __asm{ mov eax, a; imul eax, b; setno good; mov ebx, result; mov dword ptr [ebx], eax; } return good; } |
MASM made it easy. I was pleased. I should have stopped there. What happened next, however, was that I decided to write this article for #AltDevBlogADay.
When I write for #AltDevBlogADay, I do it at home. On my laptop. Which doesn’t have MSVC. The #AltDevBlogADay readers deserve, I thought, more than just MASM. They deserve a GNUC compatible version. They deserve that, and a comparison of the code generated, with inline assembler and pure C compiled with -O3.
What I learnt, my friends, is that the GNUC inline assembler syntax is evil. Pure evil. Evil genius, perhaps. Remarkably powerful in ways I hadn’t even considered possible. But still evil.
It is not for the faint-hearted, and might not be the optimal solution, but here it is:
1 2 3 4 5 6 7 8 9 10 11 | int SafeMultiply( int a, int b, int* result ) { char good; __asm__( "imull %2, %3;\n\t" "setno %0;\n\t" "movl %3, %1;" : "=r" (good), "=r" (*result) : "r" (a), "r" (b) ); return good; } |
Note, that I’ve shown here each version on it’s own. In reality it’s all mashed together with appropriate _MSC_VER and __GNUC__ #ifdefs.
I shan’t waste your time trying to explain the syntax, if you’d like to get into that (and I recommend you do, it is fascinating), then the two pages I give at the bottom of this article could be a good place to start.
Anyway, I promised a look at the generated code. This was compiled on OS X with GCC, using -O3 and the -save-temps command line. I have stripped out the function call header code, and parred it down to only what is relevant: the different bodies of the functions.
The C implementation generates this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | movl %esi, %ecx imull %edi, %ecx movl %ecx, (%rdx) movl $1, %eax testl %edi, %edi je L4 movl %ecx, %edx movl %ecx, %eax sarl $31, %edx idivl %edi cmpl %esi, %eax sete %al movzbl %al, %eax L4: leave ret |
And the inline assembler (GNUC version), generates this:
1 2 3 4 5 6 7 | imull %edi, %esi; setno %dil; movl %esi, %eax; movl %eax, (%rdx) movsbl %dil,%eax leave ret |
Now I’m no expert, and I’ve not timed it, but I can see that the latter is smaller, and I think its a safe bet that it’s also faster.
Thanks for reading!
With thanks to everyone who gave me feedback on my initial draft, and especially to Fabian Giesen and Don Olmstead for their constructive critisism, and for the title suggestion.