Tricks With the Floating-Point Format

Instapaper Text

Tricks With the Floating-Point Format

Years ago I wrote an article about how to do epsilon floating-point comparisons by using integer comparisons. That article has been quite popular (it is frequently cited, and the code samples have been used by a number of companies) and this worries me a bit, because the article has some flaws. I’m not going to link to the article because I want to replace it, not send people looking for it.

Today I am going to start setting the groundwork for explaining how and why this trick works, while also exploring the weird and wonderful world of floating-point math.

There are lots of references that explain the layout and decoding of floating-point numbers. In this post I am going to supply the layout, and then show how to reverse engineer the decoding process through experimentation.

The 2008 version of the standard adds new formats but doesn’t change the existing ones, which have been standardized for over 25 years.

A 32-bit float consists of a one-bit sign field, an eight-bit exponent field, and a twenty-three-bit mantissa field. The union below shows the layout of a 32-bit float. This union is very useful for exploring and working with the internals of floating-point numbers. I don’t recommend using this union for production coding (it is a violation of the aliasing rules for some compilers, and will probably generate inefficient code), but it is useful for learning.

union Float_t
 
  {
 
      int32_t i;
 
      float f;
 
      struct
 
      {
 
          uint32_t mantissa : 23;
 
          uint32_t exponent : 8;
 
          uint32_t sign : 1;
 
      } parts;
 
  };

The format for 32-bit float numbers was carefully designed to allow them to be put in a union with an integer, and the aliasing of ‘i’ and ‘f’ should work on all platforms (if, such as gcc and VC++, they allow aliasing through unions), with the sign bit of the integer and the float occupying the same location.

The layout of bitfields is compiler dependent so the bitfield struct that is also in the union may not work on all platforms. However it works on Visual C++ on x86 and x64, which is good enough for my exploratory purposes.

In order to really understand floats, it is important to explore and experiment. One way to explore is to write code like this, in a debug build so that the debugger doesn’t optimize it away:

Float_t num;
 
  num.f = 1.0f;
 
  num.i -= 1;
 
  printf("Float value, representation, sign, exponent, mantissa\n");
 
  for (;;)
 
  {
 
      printf("%1.8e, 0x%08X, %d, %d, 0x%06X\n",
 
                  num.f, num.i,
 
                  num.parts.sign, num.parts.exponent, num.parts.mantissa);
 
  }

Put a breakpoint on the ‘printf’ statement and then add the various components of num to your debugger’s watch window and examine them, like this:

You can then start trying interactive experiments, such as incrementing the mantissa or exponent fields, incrementing num.i, or toggling the value of the sign field. As you do this you should watch num.f to see how it changes. Or, assign various floating-point values to num.f and see how the other fields change. You can either view the results in the debugger’s watch window, or hit ‘Run’ after each change so that the printf statement executes and prints some nicely formatted results.

Go ahead. Put Float_t and the sample code into a project and play around with it for a few minutes. Discover the minimum and maximum float values. Experiment with the minimum and maximum mantissa values in various combinations. Think about the implications. This is the best way to learn. I’ll wait.

I’ve put some of the results that you might encounter during this experimentation into the table below:

Float value	Integer representation	Sign	Exponent field	Mantissa field
0.0	0×00000000	0	0	0
1.40129846e-45	0×00000001	0	0	1
1.17549435e-38	0×00800000	0	1	0
0.2	0x3E4CCCCD	0	124	0x4CCCCD
1.0	0x3F800000	0	127	0
1.5	0x3FC00000	0	127	0×400000
1.75	0x3FE00000	0	127	0×600000
1.99999988	0x3FFFFFFF	0	127	0x7FFFFF
2.0	0×40000000	0	128	0
16,777,215	0x4B7FFFFF	0	150	0x7FFFFF
3.40282347e+38	0x7F7FFFFF	0	254	0x7FFFFF
Positive infinity	0x7f800000	0	255	0

With this information we can begin to understand the decoding of floats. Floats use an base-two exponential format so we would expect the decoding to be mantissa * 2^exponent. However in the encodings for 1.0 and 2.0 the mantissa is zero, so how can this work? It works because of a clever trick. Normalized numbers in base-two scientific notation are always of the form 1.xxxx*2^exp, so storing the leading one is not necessary. By omitting the leading one we get an extra bit of precision – the 23-bit field of a float actually manages to hold 24 bits of precision because there is an implied ‘one’ bit with a value of 0×800000.

The exponent for 1.0 should be zero but the exponent field is 127. That’s because the exponent is stored in excess 127 form. To convert from the value in the exponent field to the value of the exponent you simply subtract 127.

The two exceptions to this exponent rule are when the exponent field is 255 or zero. 255 is a special exponent value that indicates that the float is either infinity or a NAN (not-a-number), with a zero mantissa indicating infinity. Zero is a special exponent value that indicates that there is no implied leading one, meaning that these numbers are not normalized. This is necessary in order to exactly represent zero. The exponent value in that case is –126, which is the same as when the exponent field is one.

To clarify the exponent rules I’ve added an “Exponent value” column which shows the actual binary exponent implied by the exponent field:

Float value	Integer representation	Sign	Exponent field	Exponent value	Mantissa field
0.0	0×00000000	0	0	-126	0
1.40129846e-45	0×00000001	0	0	-126	1
1.17549435e-38	0×00800000	0	1	-126	0
0.2	0x3E4CCCCD	0	124	-3	0x4CCCCD
1.0	0x3F800000	0	127	0	0
1.5	0x3FC00000	0	127	0	0×400000
1.75	0x3FE00000	0	127	0	0×600000
1.99999988	0x3FFFFFFF	0	127	0	0x7FFFFF
2.0	0×40000000	0	128	1	0
16,777,215	0x4B7FFFFF	0	150	23	0x7FFFFF
3.40282347e+38	0x7F7FFFFF	0	254	127	0x7FFFFF
Positive infinity	0x7f800000	0	255	Infinite!	0

Although these examples don’t show it, negative numbers are dealt with by setting the sign field to 1, which is called sign-and-magnitude form. All numbers, even zero and infinity, have negative versions.

The numbers in this chart were chosen in order to demonstrate various things:

0.0: It’s handy that zero is represented by all zeroes. However there is also a negative zero which has the sign bit set. Negative zero is equal to positive zero.
1.40129846e-45: This is the smallest positive float, and its integer representation is the smallest positive integer
1.17549435e-38: This is the smallest float with an implied leading one, the smallest number with a non-zero exponent, the smallest normalized float. This number is also FLT_MIN. Note that FLT_MIN is not the smallest float. There are actually about 8 million positive floats smaller than FLT_MIN.
0.2: This is an example of one of the many decimal numbers that cannot be precisely represented with a binary floating-point format. That mantissa wants to repeat ‘C’ forever.
1.0: Note the exponent and the mantissa, and memorize the integer representation in case you see it in hex dumps.
1.5, 1.75: Just a couple of slightly larger numbers to show the mantissa changing while the exponent stays the same.
1.99999988: This is the largest float that has the same exponent as 1.0, and the largest float that is smaller than 2.0.
2.0: Notice that the exponent is one higher than for 1.0, and the integer representation and exponent are one higher than for 1.99999988.
16,777,215: This is the largest odd float. The next larger float has an exponent value of 24, which means the mantissa is shifted enough left that odd numbers are impossible. Note that this means that above 16,777,216 a float has less precision than an int.
3.40282347e+38: FLT_MAX. The largest finite float, with the maximum finite exponent and the maximum mantissa.
Positive infinity: The papa bear of floats.

We can now describe how to decode the float format:

If the exponent field is 255 then the number is infinity (if the mantissa is zero) or a NaN (if the mantissa is non-zero)
If the exponent field is from 1 to 254 then the exponent is between –126 and 127, there is an implied leading one, and the float’s value is:

(1.0 + mantissa-field / 0×800000) * 2^(exponent-field-127)

If the exponent field is zero then the exponent is –126, there is no implied leading one, and the float’s value is:

(mantissa-field / 0×800000) * 2^-126

If the sign bit is set then negate the value of the float

The excess-127 exponent and the omitted leading one lead to some very convenient characteristics of floats, but I’ve rambled on too long so those must be saved for the next post, in a fortnight*1.0714.

#AltDevBlog

Bruce-Dawson

Tricks With the Floating-Point Format