Prologue
Hello and welcome to the 2nd part of the C / C++ low level curriculum series of posts that I’m currently doing.
Here’s a link to the first one if you missed it: /2011/11/09/a-low-level-curriculum-for-c-and-c/
This post is going to be a little lighter than most of the other posts in the series, primarily because this post is vying for my spare time with my urge to save a blonde girl with pointy ears from the skinny androgynous Demon Lord of extended monologue in a virtual universe powered by three equilateral triangles.
Before we continue, I’d like to quickly bring to public note a book that has now been recommended to me many times as a result of the first post: http://www1.idc.ac.il/tecs
I can’t personally vouch for it, but I fully intend to buy it and grok its face off as soon as I get some spare time in my schedule. This book looks awesome, and if it is half as good as it looks to be then reading it should be an extremely worthwhile investment of your time…
Assumptions
The next thing on my agenda is to discuss assumptions.
Assumptions are dangerous. Even by writing this I am making many assumptions – that you have a computer, that you can read and understand The Queen’s English, and that on some level you care about understanding the low-level of C++ to name but a few.
Consequently, dear reader, I feel that it’s worth mentioning what I assume about you before I go any further.
The important thing, I guess, that I should mention is that I assume that you are already familiar with and comfortable using C and/or C++. If you’re not, then I’d advise you to go and get comfortable before you read any more of this :)
Data Types?
So, again, I find myself almost instantly qualifying the title of the post and explaining what I mean when I say data types.
What I am talking about is the “Fundamental” types of C++ and what you should know about how they relate to the machine level – even this seemingly straightforward aspect of C++ is not necessarily what you would expect; especially when dealing with multiple target platforms.
Whilst this isn’t the kind of information that will suddenly improve your code by an order of magnitude, it is (in my opinion) one of the key building blocks of understanding C / C++ at the low level; as it has tonnes of potential knock on effects in terms of speed of execution, memory layout of complex types etc.
Certainly, no-one ever sat me down and explained this to me, I just sort of absorbed it or looked it up over the years.
Fundamental and Intrinsic Types
The fundamental types of C/C++ are all the types that have a language keyword.
These are not to be confused with the intrinsic types which are the types that are natively handled by some given CPU (i.e. the data types that the machine instructions of that CPU operate on).
Whenever you use new hardware you should check how the compiler for your platform is representing your fundamental types. The best way to do this is (can you guess?) to look at the disassembly window.
These days all fundamental types of C++ can be represented by an intrinsic type on most platforms; but you definitely shouldn’t take this for granted, it has only really been the case since the current console hardware generation.
There are 3 categories of fundamental type: integer, floating, and void.
As we all know, the void type cannot be used to store values. It is used to specify “no type”.
For both integral and floating point types there is a progression of types that can hold larger values and/or have more numerical precision.
For integers this progression is (from least to most precision) char, short, int, long, long long; and for floats: float, double, long double.
Clearly, the numerical value limits that a given type must be able to store mandate a certain minimum data size for that type (i.e. number of bits needed to store the prescribed values when stored in binary).
Sizes of Fundamental types
As far as I have been able to discover, the C and C++ standards make no explicit guarantee about the specific size of any of the Fundamental types
There are, however, several key rules about the sizes of the various types which I have paraphrased below:
- A char must be a minimum of 8 bits.
- sizeof( char ) == 1.
- If a pointer of type char* points at the very first address of a contiguous block of memory, then every single address in that block of memory must be traversable by simply incrementing that pointer.
- The C standard specifies a value that each of the integer types must be able to represent (see page 33 in this .pdf of the C standard if you want the values - see the header of a standard conformant C++ implementation for details of the values used by your compiler).
- The C++ standard says nothing about size, only that “There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long long int”. In this list, each type provides at least as much storage as those preceding it in the list.” (see page 75 in this .pdf of the latest C++ standard I could find).
- 4 & 5 have similar rules in the C and C++ standard for the progression of floats.
Helpfully, MSDN has a useful summary of this information (though it’s partly MSVC specific, it’s a good starting point).
Despite all this leeway in the standard, the size of the fundamental types across PC and current gen console platforms is (to the best of my knowledge) relatively consistent.
The C++ standard also defines bool as an integral type. It has two values, true and false, which can be implicitly converted to and from the integer values 1 and 0 respectively; and is the return type of all the logical operators (==, !=, >, < etc.).
As far as I have been able to ascertain, the standard only specifies that bool must be able to represent a binary state. Consequently, the size of bool can vary dramatically according to compiler implementation, and even within code generated by the same compiler – I have seen it vary between 1 and 4 bytes on platforms I’ve used – I have always assumed that this was down to speed of execution vs. storage size tradeoffs.
This ‘size of bool’ issue resulted in the use of bool being banned from use in complex data structures at least one company that I have worked at. I should clarify that this was a ‘proactive’ banning based on the fact that it might cause trouble rather than one that resulted from trouble actually having been caused.
We should also mention enums at this point (thanks John!) – the standard gives the storage value of an enumerated type the liberty to vary in size depending on the range of values represented by each specific enum – even within the same codebase – so an enum with values < 255 (or <= 256 members with no values assigned) may well have sizeof() == 1, and one which has to represent 32 bit values would typically have sizeof() == 4.
This brings us onto pointers. Strictly speaking pointers are not defined as one of the fundamental types, but the value of a pointer clearly has a corresponding data size so we’re covering them here.
The first thing to note about pointers is that the numeric limits required for a pointer on any given platform are determined by the size of the addressable memory on that platform.
If you have 1 GB of memory that must be accessible in 1 byte increments, then a pointer needs to be able to hold values up to ((1024 * 1024 * 1024) – 1), which is (2^30 -1) or 30 bits. 4GB is the most that can be addressed with a 32 bit value – which is why win32 systems can’t make use of more than 4GB.
For example, when compiling for win32 with VS2010, pointers are 32 bit (i.e. sizeof() ==4), and when compiling for OSX with XCode (on the Macbook Pro I use at work for iOS development) pointers are 42 bit (sizeof() ==6).
One thing that is definitely worth noting is that all data pointers produced by a given compiler will be the same size (n.b. this is not true of function pointers). The type of a pointer is, after all, a language level abstraction – under the hood they are all just a memory address. This is also why they can all be happily converted to and from void* – void* being a ‘typeless pointer’ (n.b. function pointers cannot be converted to or from void*).
That said, knowing the type of the pointer is absolutely crucial to the low level of many of the higher level language mechanisms – as we shall see in later posts.
Addendum
So, following on from a couple of the comments, I need to cover function pointers as separate from data pointers.
I made an incorrect assertion that all pointers were the same size. This is only true of data pointers.
Function pointers can be of different sizes precisely because they are not necessarily just memory addresses – in the case of multiply inherited functions or virtual functions they are typically structures.
I recommend the blog that Bryan Robertson linked me to, as it gives a concrete example of why pointer to member functions often need to be more than a memory address: