## July 20, 2010

### About floating point arithmetic

In general, to represent numbers with fractional parts, computers use a “floating point” binary representation. Floating point arithmetic is also used by AmiBroker for AFL calculations. For some more information about floating point representation in general see the following article, here we will only discuss some practical aspects.

http://en.wikipedia.org/wiki/Floating_point

Floating point calculations are performed in **hardware** by FPU (Floating Point Unit), which today is a part of your computer’s processor (CPU). The calculations are performed according to IEEE754 standard (see below) that all CPU manufacturers follow.

Internally in computers all numbers are represented in **binary** system.

This fact has some important consequences in practice. One of it is that some fractions that have finite representation in decimal system are not finite in binary.

For example, 0.1 is **endless** (infinite) fraction in binary system, as 1/3 or 2/3 fractions are in decimal system.

The mantissa of 0.1 fraction in binary system is cyclical:

1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 (1001)* …….

* represents endless cycle.

Since computers operate on limited word length, 32 bit binary representation of 0.1 is:

01111011 10011001100110011001100 (not rounded) (in decimal it is 0.09999999403953552)

or

01111011 10011001100110011001101 (rounded) (in decimal it is 0.10000000149011612)

Later is used by FPU (floating point processor) in your CPU because it has smaller relative error. If you add it nine times you will end up with 0.9000000134110449 which is of course higher than 0.9.

It will be easier for you to understand when explaining on decimal numbers. For example 2/3 represented in decimal is:

0.666666667

Now add THREE times this number (0.666666667 + 0.666666667 + 0.666666667) and what you will get?

2.000000001 and that is greater than 2.

Therefore, it is rule in programming, NEVER use fractions for loop counters. For that reason you should not use fractions in Optimize(), or if you need to use them use them in WISE way by adding HALF of “step” value to the “max” value.

```
step = 0.1;
```

x = Optimize("x", 0.5, 0.1, 0.9 + step/2, step);

Another thing to keep in mind is that 32-bit floating point number has only 7 significant digits (those digits that carry meaning contributing to its precision). So in 123.4567 all digits are significant and accurate, but in 123.456789, last two digits (‘8’ and ‘9’) are not significant and subject to floating point rounding – see links below).

See also:

About significant figures:

http://en.wikipedia.org/wiki/Significant_figures

IEEE754 conversion calculators:

http://babbage.cs.qc.edu/IEEE-754/Decimal.html

http://babbage.cs.qc.edu/IEEE-754/32bit.html

IEEE754 standard description:

http://en.wikipedia.org/wiki/IEEE_754-1985

Essay about comparing floatin point numbers:

http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

Microsoft Knowledge Base: “Precision and Accuracy in Floating-Point Calculations”

http://support.microsoft.com/kb/125056

Using 32-bit floating point (as opposed to 64-bit double precision) has two main advantages:

a) consumes HALF of memory required for doubles (this *is* important, more important that you think, because if you have for example an array of 500000 elements, in floats it is 2MB and it fits into CPU cache, while in doubles it would be 4MB and may not fit into CPU cache).

b) 32-bit floating point numbers can be computed much faster. It simply takes less processor cycles to compute 32-bit float than 64-bit float. 64-bit version of AmiBroker is going even further and is using SSE2 instructions. This means that vectors of 4 single-precision numbers are processed in parallel using single SSE2 instruction on SINGLE processor. This gives more speed on single core than achievable using multiple cores.

Note also AmiBroker does use double precision (64bit/SSE2) or extended precision (80bit/x87) where it is necessary (for certain internal calculations).

Note also that due to the architectural differences between compilers for 32-bit and 64-bit programs,

small numerical differences may exist between 32-bit and 64-bit version of AmiBroker due to the fact that 32-bit version uses x87 FPU code while 64-bit version uses SSE2 code and the underlying floating point hardware is different and operate slightly differently.

x87 FPU code internally offers more (80bit) precision than SSE2 (64-bit).

For in-depth discussion see:

https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/

Filed by Tomasz Janeczko at 3:37 am under AFL

Comments Off on About floating point arithmetic