C / C++ Low Level Curriculum part 2: Data Types

Instapaper Text

C / C++ Low Level Curriculum part 2: Data Types

Prologue

Hello and welcome to the 2nd part of the C / C++ low level curriculum series of posts that I’m currently doing.

Here’s a link to the first one if you missed it: /2011/11/09/a-low-level-curriculum-for-c-and-c/

This post is going to be a little lighter than most of the other posts in the series, primarily because this post is vying for my spare time with my urge to save a blonde girl with pointy ears from the skinny androgynous Demon Lord of extended monologue in a virtual universe powered by three equilateral triangles.

Before we continue, I’d like to quickly bring to public note a book that has now been recommended to me many times as a result of the first post: http://www1.idc.ac.il/tecs

I can’t personally vouch for it, but I fully intend to buy it and grok its face off as soon as I get some spare time in my schedule. This book looks awesome, and if it is half as good as it looks to be then reading it should be an extremely worthwhile investment of your time…

Assumptions

The next thing on my agenda is to discuss assumptions.

Assumptions are dangerous. Even by writing this I am making many assumptions – that you have a computer, that you can read and understand The Queen’s English, and that on some level you care about understanding the low-level of C++ to name but a few.

Consequently, dear reader, I feel that it’s worth mentioning what I assume about you before I go any further.

The important thing, I guess, that I should mention is that I assume that you are already familiar with and comfortable using C and/or C++. If you’re not, then I’d advise you to go and get comfortable before you read any more of this :)

Data Types?

So, again, I find myself almost instantly qualifying the title of the post and explaining what I mean when I say data types.

What I am talking about is the “Fundamental” types of C++ and what you should know about how they relate to the machine level – even this seemingly straightforward aspect of C++ is not necessarily what you would expect; especially when dealing with multiple target platforms.

Whilst this isn’t the kind of information that will suddenly improve your code by an order of magnitude, it is (in my opinion) one of the key building blocks of understanding C / C++ at the low level; as it has tonnes of potential knock on effects in terms of speed of execution, memory layout of complex types etc.

Certainly, no-one ever sat me down and explained this to me, I just sort of absorbed it or looked it up over the years.

Fundamental and Intrinsic Types

The fundamental types of C/C++ are all the types that have a language keyword.

These are not to be confused with the intrinsic types which are the types that are natively handled by some given CPU (i.e. the data types that the machine instructions of that CPU operate on).

Whenever you use new hardware you should check how the compiler for your platform is representing your fundamental types. The best way to do this is (can you guess?) to look at the disassembly window.

These days all fundamental types of C++ can be represented by an intrinsic type on most platforms; but you definitely shouldn’t take this for granted, it has only really been the case since the current console hardware generation.

There are 3 categories of fundamental type: integer, floating, and void.

As we all know, the void type cannot be used to store values. It is used to specify “no type”.

For both integral and floating point types there is a progression of types that can hold larger values and/or have more numerical precision.

For integers this progression is (from least to most precision) char, short, int, long, long long; and for floats: float, double, long double.

Clearly, the numerical value limits that a given type must be able to store mandate a certain minimum data size for that type (i.e. number of bits needed to store the prescribed values when stored in binary).

Sizes of Fundamental types

As far as I have been able to discover, the C and C++ standards make no explicit guarantee about the specific size of any of the Fundamental types

There are, however, several key rules about the sizes of the various types which I have paraphrased below:

A char must be a minimum of 8 bits.
sizeof( char ) == 1.
If a pointer of type char* points at the very first address of a contiguous block of memory, then every single address in that block of memory must be traversable by simply incrementing that pointer.
The C standard specifies a value that each of the integer types must be able to represent (see page 33 in this .pdf of the C standard if you want the values - see the header of a standard conformant C++ implementation for details of the values used by your compiler).
The C++ standard says nothing about size, only that “There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long long int”. In this list, each type provides at least as much storage as those preceding it in the list.” (see page 75 in this .pdf of the latest C++ standard I could find).
4 & 5 have similar rules in the C and C++ standard for the progression of floats.

Helpfully, MSDN has a useful summary of this information (though it’s partly MSVC specific, it’s a good starting point).

Despite all this leeway in the standard, the size of the fundamental types across PC and current gen console platforms is (to the best of my knowledge) relatively consistent.

The C++ standard also defines bool as an integral type. It has two values, true and false, which can be implicitly converted to and from the integer values 1 and 0 respectively; and is the return type of all the logical operators (==, !=, >, < etc.).

As far as I have been able to ascertain, the standard only specifies that bool must be able to represent a binary state. Consequently, the size of bool can vary dramatically according to compiler implementation, and even within code generated by the same compiler – I have seen it vary between 1 and 4 bytes on platforms I’ve used – I have always assumed that this was down to speed of execution vs. storage size tradeoffs.

This ‘size of bool’ issue resulted in the use of bool being banned from use in complex data structures at least one company that I have worked at. I should clarify that this was a ‘proactive’ banning based on the fact that it might cause trouble rather than one that resulted from trouble actually having been caused.

We should also mention enums at this point (thanks John!) – the standard gives the storage value of an enumerated type the liberty to vary in size depending on the range of values represented by each specific enum – even within the same codebase – so an enum with values < 255 (or <= 256 members with no values assigned) may well have sizeof() == 1, and one which has to represent 32 bit values would typically have sizeof() == 4.

This brings us onto pointers. Strictly speaking pointers are not defined as one of the fundamental types, but the value of a pointer clearly has a corresponding data size so we’re covering them here.

The first thing to note about pointers is that the numeric limits required for a pointer on any given platform are determined by the size of the addressable memory on that platform.

If you have 1 GB of memory that must be accessible in 1 byte increments, then a pointer needs to be able to hold values up to ((1024 * 1024 * 1024) – 1), which is (2^30 -1) or 30 bits. 4GB is the most that can be addressed with a 32 bit value – which is why win32 systems can’t make use of more than 4GB.

For example, when compiling for win32 with VS2010, pointers are 32 bit (i.e. sizeof() ==4), and when compiling for OSX with XCode (on the Macbook Pro I use at work for iOS development) pointers are 42 bit (sizeof() ==6).

One thing that is definitely worth noting is that all data pointers produced by a given compiler will be the same size (n.b. this is not true of function pointers). The type of a pointer is, after all, a language level abstraction – under the hood they are all just a memory address. This is also why they can all be happily converted to and from void* – void* being a ‘typeless pointer’ (n.b. function pointers cannot be converted to or from void*).

That said, knowing the type of the pointer is absolutely crucial to the low level of many of the higher level language mechanisms – as we shall see in later posts.

Addendum

So, following on from a couple of the comments, I need to cover function pointers as separate from data pointers.

I made an incorrect assertion that all pointers were the same size. This is only true of data pointers.

Function pointers can be of different sizes precisely because they are not necessarily just memory addresses – in the case of multiply inherited functions or virtual functions they are typically structures.

I recommend the blog that Bryan Robertson linked me to, as it gives a concrete example of why pointer to member functions often need to be more than a memory address:

C++ Fundamental types (win32 compiled on Windows 7 with VS2010)

My home machine is a 64 bit intel thing of some description, about a year old.

Since the processor is 64 bit, I’d hope that all of these sizes correspond to intrinsic types (8 bytes being the size of a 64 bit CPU register), however since I’m compiling for win32 (which can only fit 4 bytes in a standard CPU register) I’m guessing that it won’t be using intrinsics for types > 32 bit.

adding 2 long long values and storing the result in a 3rd long long

In any event, I can’t be sure without looking at the disassembly.

<…pause to add some simple test code with long long and run it…>

Sure enough, these 8 byte long long values are being handled as 2 32bit values.

Ignoring the actual addition, you can clearly see this because the code initialising llTest and llTest2 is setting them in two separate steps for the upper and lower 32 bits of the 64 bit values.

So now I know, and it wasn’t even scary – really I should go and check the rest of them…

Fancy Intrinsics

Most modern CPUs have fancy intrinsics – e.g. 128 bit vector registers that can store and operate on four-32bit-floats-in-one-value sort of stuff.

In theory these sorts of extra intrinsics can provide big wins in certain situations – e.g. heavy duty chunks of vector maths, or non vector maths that can be parallelised into vectors.

The chances are that your compiler won’t ever use these without you asking it nicely. There are plenty of good reasons why this is the case (apparently), but you should find that support for these hardware specific intrinsics will be mentioned in your hardware / compiler manuals.

Summary

So, what would I like you to take away from this?

Firstly, that there is a difference between the data types of the C++ language and the hardware data types.

Secondly, don’t just trust that your compiler is doing what would intuitively seem sensible to you. Check its work.

Thirdly, it’s not rocket science! You can find out by just modifying one of the sample programs for your new hardware and then looking at the disassembly in the debugger.

Finally, thought I might insert a few points of note here:

Almost all CPUs have 8 bit bytes. Any CPU with more than 8 bits per byte was probably designed by a maniac / genius (n.b. I find that there is a particularly fine line between the two in Computer Science circles).
One thing you need to watch out for with numerical types is that in the C standard, int and short both have the same numerical limit (unsigned int and unsigned short both have 0xFFFF (i.e. 16 bits)). I’ve never had a problem with it, but an int could be represented as 16 bit.
If you want to know the size of any given type just use the sizeof() keyword. Your compiler knows these things.

Epilogue

If you are hungry for more information on this level (i.e. fundamental and intrinsic types) I recommend searching #AltDevBlogADay, because there are loads to choose from…

Here are a few of articles I found when doing a quick search (apologies to those whose articles I missed as a result of less than thorough searching!):

/2011/08/21/practical-flt-point-tricks/

/2011/08/06/demise-low-level-programmer/

/2011/11/10/optimisation_lessons/

#AltDevBlog

Alex Darby Follow @darbotron