Float Precision–From Zero to 100+ Digits

Instapaper Text

Float Precision–From Zero to 100+ Digits

How much precision does a float have? It depends on the float, and it depends on what you mean by precision. Typical reasonable answers range from 6-9 decimal digits, but it turns out that you can make a case for anything from zero to over one hundred digits.

In all cases in this article when I talk about precision, or about how many digits it takes to represent a float, I am talking about mantissa digits. When printing floating-point numbers you often also need a couple of +/- characters, an ‘e’, and a few digits for the exponent, but I’m just going to focus on the mantissa.

Previously on this channel…

If you’re just joining us then you may find it helpful to read some of the earlier posts in this series. The first one is the most important since it gives an overview of the layout and interpretation of floats, which is helpful to understand this post.

1: Tricks With the Floating-Point Format – an overview of the float format
2: Stupid Float Tricks – incrementing the integer representation
3: Don’t Store That in a Float – a cautionary tale about time
3b: They sure look equal… – special bonus post (not on altdevblogaday), also about precision
4: Comparing Floating Point Numbers, 2012 Edition
5: Float Precision—From Zero to 100+ Digits (return *this;)

What precision means

For most of our purposes when we say that a format has n-digit precision we mean that over some range, typically [10^k, 10^(k+1)), where k is an integer, all n-digit numbers can be uniquely identified. For instance, from 1.000000e6 to 9.999999e6 if your number format can represent all numbers with seven digits of mantissa precision then you can say that your number format has seven digit precision over that range, where ‘k’ is 6.

Similarly, from 1.000e-1 to 9.999e-1 if your number format can represent all the numbers with four digits of precision then you can say that your number format has four digit precision over that range, where ‘k’ is -1.

Your number format may not always be able to represent each number in such a range precisely (0.1 being a tired example of a number that cannot be exactly represented as a float) but to have n-digit precision we must have a number that is closer to each number than to either of its n-digit neighbors.

This definition of precision is similar to the concept of significant figures in science. The number 1.03e4 and 9.87e9 are both presumed to have three significant figures, or three digits of precision.

Wasted digits and wobble

The “significant figures” definition of precision is sometimes necessary, but it’s not great for numerical analysis where we are more concerned about relative error. The relative error in, let’s say a three digit decimal number, varies widely. If you add one to a three digit number then, depending on whether the number is 100 or 998, it may increase by 1%, or by barely 0.1%.

If you take an arbitrary real number from 99.500… to 999.500… and assign it to a three digit decimal number then you will be forced to round the number up or down by up to half a unit in the last place, or 0.5 decimal the wobble.

Wobble also affects binary numbers, but to a lesser degree. The relative precision available from a fixed number of binary digits varies depending on whether the leading digits are 10000 or 11111. Unlike base ten where the relative precision can be almost ten times lower for numbers that start with 10000, the relative precision for base two only varies by a factor of two (again, the base).

In more concrete terms, the wobble in the float format means that the relative precision of a float is usually between 1/8388608 and 1/16777216.

The minimized wobble of binary floating-point numbers and the more consistent accuracy this leads to is one of the significant advantages of binary floating-point numbers over larger bases.

This variation in relative precision is important later on and can mean that we ‘waste’ almost an entire digit, binary or decimal, when converting numbers from the other base.

Subnormal precision: 0-5 digits

Float numbers normally have fairly consistent precision, but in some cases their precision is significantly lower – as little as zero digits. This happens with denormalized, or ‘subnormal’, numbers. Most float numbers have an implied leading one that gives them 24 bits of mantissa. However, as discussed in my first post, floats with the exponent set to zero necessarily have no implied leading one. This means that their mantissa has just 23 bits, they are not normalized, and hence they are called subnormals. If enough of the leading bits are zero then we have as little as one bit of precision.

As an example consider the smallest positive non-zero float. This number’s integer representation is 0×00000001 and its value is 2^-149, or approximately 1.401e-45f. This value comes from its exponent (-126) and the fact that its one non-zero bit in its mantissa is 23 bits to the right of the mantissa’s binary point. All subnormal numbers have the same exponent (-126) so they are all multiples of this number.

The binary exponent in a float varies from -126 to 127

Since the floats in the range with decimal exponent -45 (subnormals all of them) are all multiples of this number their mantissas are (roughly) 1.4, 2.8, 4.2, 5.6, 7.0, 8.4, and 9.8. If we print them to one-digit of precision then we get (ignoring the exponent, which is -45) 1, 3, 4, 6, 7, 8, and 10. Since 2, 5, and 9 are missing that means that we don’t even have one digit of precision!

Since all subnormal numbers are multiples of 1.401e-45f, subsequent ranges each have one additional digit of precision. Therefore the ranges with decimal exponents -45, -44, -43, -42, -41, and -40 have 0, 1, 2, 3, 4, and 5 digits of precision.

Normal precision

Normal floats have a 24-bit mantissa and greater precision than subnormals. We can easily calculate how many decimal digits the 24-bit mantissa of a float is equivalent to: 24*LOG(2)/LOG(10) which is equal to about 7.225. But what does 7.225 digits actually mean? It depends whether you are concerned about how many digits you can rely on, or how many digits you need.

Representing decimals: 6-7 digits

Our definition of n-digit precision is being able to represent all n-digit numbers over a range [10^k, 10^(k+1)). There are about 28 million floats in any such (normalized) range, which is more than enough for seven digits of precision, but they are not evenly distributed, with the density being much higher at the bottom of the range. Sometimes there are not enough of them near the top of the range to uniquely identify all seven digit numbers.

In some ranges the exponent lines up such that we may (due to the wobble issues mentioned at the top) waste almost a full bit of precision, which is equivalent to ~0.301 decimal digits (log(2)/log(10)), and therefore we have only ~6.924 digits. In these cases we don’t quite have seven digits of precision.

I wrote some quick-and dirty code that scans through various ranges with ‘k’ varying from -37 to 37 to look for these cases.

FLT_MIN (the smallest normalized float) is about 1.175e-38F, FLT_MAX is about 3.402e+38F

My test code calculates the desired 7-digit number using double precision math, assigns it to a float, and then prints the float and the double to 7 digits of precision. The printing is assumed to use correct rounding, and if the results from the float and the double don’t match then we know we have a number that cannot be uniquely identified/represented as a float.

Across all two billion or so positive floats tested I measured 784,757 seven-digit numbers that could not be uniquely identified, or about 0.04% of the total. For instance, from 1.000000e9 to 8.589972e9 was fine, but from there to 9.999999e9 there were 33,048 7-digit numbers that could not be represented. It’s a bit subtle, but we can see what is happening if we type some adjacent 7-digit numbers into the watch window, cast them to floats, and then cast them to double so that the debugger will print their values more precisely:

One thing to notice (in the Value column) is that none of the numbers can be exactly represented as a float. We would like the last three digits before the decimal point to all be zeroes, but that isn’t possible because at this range all floats are a multiple of 1,024. So, the compiler/debugger/IEEE-float does the best it can. In order to get seven digits of precision at this range we need a new float every 1,000 or better, but the floats are actually spaced out every 1,024. Therefore we end up missing 24 floats for each set of 1,024. In the ‘Value’ column we can see that the third and fourth numbers actually map to the same float, shown circled below:

One was rounded down, and the other was rounded up, but they were both rounded to the closest float available.

At 8.589930e9 a float’s relative precision is 1/16777216 but at 8.589974e9 it is just 1/8388608

This issue doesn’t happen earlier in this range because below 8,589,934,592 (2^33) the float exponent is smaller and therefore the precision is greater – immediately below 2^33 the representable floats are spaced just 512 units apart. The loss of decimal precision always happens late in the range because of this.

My test code showed me that this same sort of thing happens any time that the effective exponent of the last bit of the float (which is the exponent of the float minus 23) is -136, -126, -93, -83, -73, -63, -53, -43, -33, 10, 20, 30, 40, 50, 60, 70, or 103. Calculate two to those powers if you really want to see the pattern. This corresponds to just six digit precision in the ranges with decimal exponents -35, -32, -22, -19, -16, -13, -10, -7, -4, 9, 12, 15, 18, 21, 24, 27, and 37.

Therefore, over most ranges a float has (just barely) seven decimal digits of precision, but over 17 of the 75 ranges tested a float only has six.

Representing floats: 8-9 digits

The flip side of this question is figuring out how many decimal digits it takes to uniquely identify a float. Again, we aren’t concerned here with converting the exact value of the float to a decimal (we’ll get to that), but merely having enough digits to uniquely identify a particular float.

In this case it is the fuzziness of the decimal representation that can bite us. For some exponent ranges we may waste almost a full decimal digit. That means that instead of requiring ~7.225 digits to represent all floats we would expect that sometimes we would actually need ~8.225. Since we can’t use fractional digits we actually need nine in these cases. As explained in a previous post this happens about 30% of the time, which seems totally reasonable given our calculations. The rest of the time we need eight digits to uniquely identify a particular float. Use 9 to play it safe.

printf(“%1.8e”, f); ensures that a float will round-trip to decimal and back

Precisely printing float: 10-112 digits

There is one final possible meaning of precision that we can apply. It turns out that while not all decimal numbers can be exactly represented in binary (0.1 is an infinitely repeating binary number) we can exactly represent all binary numbers in decimal. That’s because 1/2 can be represented easily as 5/10, but 1/10 cannot be represented in binary.

It’s interesting to see what happens to the decimal representation of binary numbers as powers of two get smaller:

Binary_exponent	Decimal_value
-1	0.5
-2	0.25
-3	0.125
-4	0.0625
-5	0.03125
-6	0.015625
-7	0.0078125
-8	0.00390625
-9	0.001953125

Each time we decrease the exponent by one we have to add a digit one place farther along. We gradually acquire some leading zeroes, so the explosion in digits isn’t quite one-for-one, but it’s close. The number of mantissa digits needed to exactly print the value of a negative power of two is about N-floor(N*log(2)/log(10)), or ceil(N*(1-log(2)/log(10))) where N is an integer representing how negative our exponent is. That’s about 0.699 digits each time we decrement the binary exponent. The smallest power-of-two we can represent with a float is 2^-149. That comes from having just the bottom bit set in a subnormal. The exponent of subnormals floats is -126 and the position of the bit means it is 23 additional spots to the right and 126-23 = 149. We should therefore expect it to take about 105 digits to print that smallest possible float. Let’s see:

1.401,298,464,324,817,070,923,729,583,289,916,131,280,261,941,876,515,771,757,068,283,889,

791,082,685,860,601,486,638,188,362,121,582,031,25e-45

For those of you counting at home that is exactly 105 digits. It’s a triumph of theory over practice.

That’s not quite the longest number I could find. A subnormal with a mantissa filled up with ones will have seven fewer leading zeroes leading to a whopping 112 digit decimal mantissa:

1.175,494,210,692,441,075,487,029,444,849,287,348,827,052,428,745,893,333,857,174,530,571,

588,870,475,618,904,265,502,351,336,181,163,787,841,796,875e-38

Pow! Bam!

While working on this I found a bug in the VC++ CRT. pow(2.0, -149) fits perfectly in a float – albeit just barely – it is the smallest float possible. However if I pass 2.0f instead of 2.0 I find that pow(2.0f, -149) gives an answer of zero. So does pow(2.0f, -128). If you go (float)pow(2.0, -149), invoking the double precision version of the function and then casting to float, then it works. So does pow(0.5, 149).

Perversely enough powf(2.0f, -149) works. That’s because it expands out to (float)pow(double(2.0f), double(-149)).

Conveniently enough the version of pow that takes a float and an int is in the math.h header file so it’s easy enough to find the bug. The function calculates pow(float, -N) as 1/powf(float, N). The denominator overflows when N is greater than 127, giving an infinite result whose reciprocal is zero. It’s easy enough to work around, and will be noticed by few, but is still unfortunate. pow() is one of the messier functions to make both fast and accurate.

How do you print that?

The VC++ CRT, regrettably, refuses to print floats or doubles with more than sin(pi) trick explained last time.. So, we’ll need to roll our own.

Printing binary floating-point numbers efficiently and accurately is hard. In fact, when the IEEE spec was first ratified it was not yet a solved problem. But for expository purposes we don’t care about efficiency, so the problem is greatly simplified.

It turns out that any float can be represented as a fixed-point number with 128 bits in the integer part and 149 bits in the fractional part, which we can summarize as 128.149 format. We can determine this by noting that a float’s mantissa is a 1.23 fixed-point number. The maximum float exponent is 127, which is equivalent to shifting the mantissa left 127 positions. The minimum float exponent is -126, which is equivalent to shifting the mantissa right 126 positions.

shift up to 127 positions this way <— 1.000 000 000 000 000 000 000 00 —> shift up to 126 positions this way

Those shift amounts of our 1.23 mantissa mean that all floats can fit into a 128.149 fixed-point number, for a total of 277 bits.

All we need to do is create this number, by pasting the mantissa (with or without the implied leading one) into the correct location, and then convert the large fixed-point number to decimal.

Converting to decimal is done by two main steps. The integer portion is converted by repeatedly dividing it by ten and accumulating the remainders as digits (which must be reversed before using). The fractional part is converted by repeatedly multiply by ten and accumulating the overflow as digits. Simple. All you need is a simple high-precision math library and we’re sorted. There’s also some special-case checks for infinity, NaNs (print them however you want), negatives, and denormals, but it’s mostly quite straightforward. Here’s the conversion code:

/* See
 
  http://randomascii.wordpress.com/2012/01/11/tricks-with-the-floating-point-format/
 
  for the potential portability problems with the union and bit-fields below.
 
  */
 
  union Float_t
 
  {
 
      Float_t(float num = 0.0f) : f(num) {}
 
      // Portable extraction of components.
 
      bool Negative() const { return (i >> 31) != 0; }
 
      int32_t RawMantissa() const { return i & ((1 << 23) - 1); }
 
      int32_t RawExponent() const { return (i >> 23) & 0xFF; }
 
   
 
      int32_t i;
 
      float f;
 
  #ifdef _DEBUG
 
      struct
 
      {   // Bitfields for exploration. Do not use in production code.
 
          uint32_t mantissa : 23;
 
          uint32_t exponent : 8;
 
          uint32_t sign : 1;
 
      } parts;
 
  #endif
 
  };
 
   
 
  // Simple code to print a float. Any float can be represented as a fixed point
 
  // number with 128 bits in the integer part and 149 bits in the fractional part.
 
  // We can convert these two parts to decimal with repeated division/multiplication
 
  // by 10.
 
  std::string PrintFloat(float f)
 
  {
 
      // Put the float in our magic union so we can grab the components.
 
      union Float_t num(f);
 
   
 
      // Get the character that represents the sign.
 
      const std::string sign = num.Negative() ? "-" : "+";
 
   
 
      // Check for NaNs or infinity.
 
      if (num.RawExponent() == 255)
 
      {
 
          // Check for infinity
 
          if (num.RawMantissa() == 0)
 
              return sign + "infinity";
 
   
 
          // Otherwise it's a NaN.
 
          // Print the mantissa field of the NaN.
 
          char buffer[30];
 
          sprintf_s(buffer, "NaN%06X", num.RawMantissa());
 
          return sign + buffer;
 
      }
 
   
 
      // Adjust for the exponent bias.
 
      int exponentValue = num.RawExponent() - 127;
 
      // Add the implied one to the mantissa.
 
      int mantissaValue = (1 << 23) + num.RawMantissa();
 
      // Special-case for denormals - no special exponent value and
 
      // no implied one.
 
      if (num.RawExponent() == 0)
 
      {
 
          exponentValue = -126;
 
          mantissaValue = num.RawMantissa();
 
      }
 
   
 
      // The first bit of the mantissa has an implied value of one and this can
 
      // be shifted 127 positions to the left, so that is 128 bits to the left
 
      // of the binary point, or four 32-bit words for the integer part.
 
      HighPrec<4> intPart;
 
      // When our exponentValue is zero (a number in the 1.0 to 2.0 range)
 
      // we have a 24-bit mantissa and the implied value of the highest bit
 
      // is 1. We need to shift 9 bits in from the bottom to get that 24th bit
 
      // into the ones spot in the integral portion, plus the shift from the exponent.
 
      intPart.InsertLowBits(mantissaValue, 9 + exponentValue);
 
   
 
      std::string result;
 
      // Always iterate at least once, to get a leading zero.
 
      do
 
      {
 
          int remainder = intPart.DivReturnRemainder(10);
 
          result += '0' + remainder;
 
      } while (!intPart.IsZero());
 
   
 
      // Put the digits in the correct order.
 
      std::reverse(result.begin(), result.end());
 
   
 
      // Add on the sign and the decimal point.
 
      result = sign + result + '.';
 
   
 
      // We have a 23-bit mantissa to the right of the binary point and this
 
      // can be shifted 126 positions to the right so that's 149 bits, or
 
      // five 32-bit words.
 
      HighPrec<5> frac;
 
      // When exponentValue is zero we want to shift 23 bits of mantissa into
 
      // the fractional part.
 
      frac.InsertTopBits(mantissaValue, 23 - exponentValue);
 
      while (!frac.IsZero())
 
      {
 
          int overflow = frac.MulReturnOverflow(10);
 
          result += '0' + overflow;
 
      }
 
   
 
      return result;
 
  }

Converting to scientific notation and adding digit grouping is left as an exercise for the reader. A Visual C++ project that includes the missing HighPrec class and code for printing doubles can be obtained at:

ftp://ftp.cygnus-software.com/pub/PrintFullFloats.zip

Practical Implications

The reduced precision of subnormals is just another reason to avoid doing significant calculations with numbers in that range. Subnormals exist to allow gradual underflow and should only occur rarely.

Printing the full 100+ digit value of a number is rarely needed. It’s interesting to understand how it works, but that’s about it.

It is important to know how many mantissa digits it takes to uniquely identify a float. If you want to round-trip from float to decimal and back to float (saving a float to an XML file for instance) then it is important to understand that nine mantissa digits are required. I recommend printf(“%1.8e”, f). This is also important in debugging tools, and I’m told that VS 11 will fix this bug.

It can also be important to know what decimal numbers can be uniquely represented with a float. If all of your numbers are between 1e-3 and 8.58e9 then you can represent all seven digit numbers, but beyond that there are some ranges where six is all that you can get. If you want to round-trip from decimal to float and then back then you need to keep this limitation in mind.

Until next time

Next time I might cover effective use of Not A Numbers and floating-point exceptions, or general floating-point weirdness, or why float math is faster in 64-bit processes than in 32-bit /arch:SSE2 projects. Let me know what you want. I’m having fun, it’s a big topic, and I see no reason to stop now.

#AltDevBlog

Bruce-Dawson