Floating point is ubiquitous in programming these days. Hardware has improved to the point where in many environments it is actually faster to use floating point as apposed to integer (this wasn’t the case a decade ago). This post will attempt to educate the reader on various “tricks” that can help push floating point performance & relative accuracy even further and allow the programmer to avoid some of the pitfalls I have fallen into over the years.

This article will assume basic knowledge of floating point numbers

So we’ve all used them, likely for all and sundry within our games/tools so heres a question to ask yourself… what does the code below print out for the “test” variable?

{
 
      float small_value= 1.0f;
 
      float smaller_value= 1 / 100000000.0f;
 
      float test= small_value + smaller_value;
 
      printf("{%g,%g} => {0x%08x,0x%08x}\n", small_value, test, *(int*)&small_value, *(int*)&test);
 
  }
 
  

Almost all programmers I spoke to assumed the value of test to be 1.00000001f and of course mathematically it should be, however it is not, the print out would show.

{1,1} => {0x3f800000, 0x3f800000}
 
  

Single precision floating point simply cannot represent the accuracy the result requires and as such the result is kept at 1.0f.

Now if you are mathematically minded this will likely poke at your OCD gene and “force” you into using doubles. This is a perfectly palatable solution in many situations where one does not require speed however double still suffers the same fate if you double the number of zeros in the denominator (1 / 10000000000000000.0f;).

For those of us who work in games math is important, but its far less important than simulation determinism and getting things done expediently whilst executing within performance budgets. This forces us to make decisions that might otherwise be seen as somewhat mathematically incorrect. The above is one such circumstance. In most games applications we cannot afford to switch math to use doubles and as such the inherent limitations of single precision floating point math are deemed acceptable, even becoming “normal”. I’ve been working with them so long that it is now natural to apply these limitations in all circumstances as par for the course.

As was mentioned in the link i posted above comparing of floating point numbers using == is only sensible in a purely deterministic (read: constant) form. If math is being used to generate the values to be compared then great care has to be taken; great care is rarely taken. The general rule in most studios is simple: don’t compare floats using == on pain of death or public embarrassment. The usual method is to apply some form of epsilon to the comparison

{
 
      float a= 1.0f;
 
      float b= 1.0005f;
 
      static const float epsilon= 0.0001f;
 
  
 
      float temp= fabs(b - a);
 
  
 
      if(temp > epsilon)
 
      {
 
          // not the same
 
      }
 
  }
 
  

Within games this is one method used to enforce determinism; apply a decent epsilon and most math “behaves” (this doesn’t mean its accurate mind). One complication however is that epsilon is not always obvious nor can it always be constant, especially within helper classes such as Vector Math.

A good epsilon is usually dependent upon the data you’re representing. If you are dealing in world coordinates for instance where 1.0f = 1m then

0.01    = 1cm
 
  0.001   = 1mm
 
  0.0001  = 100µm
 
  0.00001 = 10µm
 
  

Many of the games i’ve worked on use 1mm as their positional epsilon (0.001f) and 10µm for directional/rotational epsilon (0.00001f). For Positional Epsilon this provides an effective range of approximately +/- 10000.001 and for rotational epsilon +/- 1000.0001f.

The general rule of thumb i use is, 8 decimal places between your highest and lowest accuracy requirements. If you feel you need a larger range than this but still want accuracy then consider using a reference frame; 1 value to represent low accuracy high values (say kilometers) and another to represent high accuracy low values (meters down to µm)

Now one area that seems to catch a lot of people out (myself included on many occasions) is Infinity & NaN Rules… so here is a handy table borrowed from here, his looks prettier.

+ -Inf -1 -0 0 1 Inf NaN
-Inf -Inf -Inf -Inf -Inf -Inf NaN NaN
-1 -Inf -2 -1 -1 0 Inf NaN
-0 -Inf -1 -0 0 1 Inf NaN
0 -Inf -1 0 0 1 Inf NaN
1 -Inf 0 1 1 2 Inf NaN
Inf NaN Inf Inf Inf Inf Inf NaN
NaN NaN NaN NaN NaN NaN NaN NaN
- -Inf -1 -0 0 1 Inf NaN
-Inf NaN -Inf -Inf -Inf -Inf -Inf NaN
-1 Inf 0 -1 -1 -2 -Inf NaN
-0 Inf 1 0 -0 -1 -Inf NaN
0 Inf 1 0 0 -1 -Inf NaN
1 Inf 2 1 1 0 -Inf NaN
Inf Inf Inf Inf Inf Inf NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN
* -Inf -1 -0 0 1 Inf NaN
-Inf Inf Inf NaN NaN -Inf -Inf NaN
-1 Inf 1 0 -0 -1 -Inf NaN
-0 NaN 0 0 -0 -0 NaN NaN
0 NaN -0 -0 0 0 NaN NaN
1 -Inf -1 -0 0 1 Inf NaN
Inf -Inf -Inf NaN NaN Inf Inf NaN
NaN NaN NaN NaN NaN NaN NaN NaN
/ -Inf -1 -0 0 1 Inf NaN
-Inf NaN Inf Inf -Inf -Inf NaN NaN
-1 0 1 Inf -Inf -1 -0 NaN
-0 0 0 NaN NaN -0 -0 NaN
0 -0 -0 NaN NaN 0 0 NaN
1 -0 -1 -Inf Inf 1 0 NaN
Inf NaN -Inf -Inf Inf Inf NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN

On the subject of error handling consider the standard method of normalizing a 3d vector.

vector3 normalize_vector(const vector3 & vec)
 
  {
 
      float length= vector_length(vec);
 
      vector3 result=vec;
 
  
 
      if(length > 0.0f)
 
      {
 
          float reciprocal= 1.0f / length;
 
  
 
          result.x *= reciprocal;
 
          result.y *= reciprocal;
 
          result.z *= reciprocal;
 
      }
 
  
 
      return result;
 
  }
 
  

We want to avoid introducing a problem by way of an INF=>NAN in the returned data or throwing an exception so a branch is inserted. This effectively removes the divide by zero problem however at great cost; on many platforms the branch is an instruction cache flush resulting in significant performance issues. The problem is there really isn’t another way to achieve the same avoidance mathematically.

There is however a method of avoiding it if you rely upon floating point math. We’ve established that a large value remains the same when a small value is added, we’ve also discussed that the effective range of floating point values for use in games is limited both for determinism and to avoid issues with the math itself. Combining the 2 provides a rather elegant method of avoiding the divide by zero issue under normalization and many other well known situations.

const float very_small_float= 1.0e-037f;
 
  vector3 normalize_vector_2(const vector3 & vec)
 
  {
 
      float length= very_small_float + vector_length(vec);
 
  
 
      float reciprocal= 1.0f / length;
 
  
 
      vector3 result;
 
  
 
      result.x = vec.x * reciprocal;
 
      result.y = vec.y * reciprocal;
 
      result.z = vec.z * reciprocal;
 
  
 
      return result;
 
  }
 
  

Due to limiting the allowed range of vector3 components and understanding that addition of the very small value to any value larger than 1.0e-29 has zero effect we have effectively removed the possibility of receiving INF and thus NAN as the result. (The Denormal length is still possible however much less likely)

There is a plethora of information out there on floating point both practical and theoretical however the above is my attempt to represent those cases not covered or not well highlighted and hopefully make people think about their implementations more in terms of the specific requirements and less in terms of “floating point numbers handle everything”