```
4-1  FLOATING-POINT NUMBERS - GENERAL VIEW
******************************************

The real number system
----------------------
Scientific and engineering calculations are performed in the REAL
NUMBER SYSTEM, a highly abstract mathematical construct.

A real number is by definition a special infinite set of rational
numbers (integer fractions) - the so called Dedkind Cuts or an
equivalent formulation.  The arithmetical operations are defined
between such sets and is a natural extension of the arithmetic of
rational numbers.

The real numbers have wonderful properties:

1) There is no lower or upper bound, in simple language
they go from minus infinity to plus infinity.

2) Infinite density - there is a real number between
any two real numbers.

3) A lot of algebraic axioms are satisfied, e.g. the
'field axioms'.

4) Completeness - they contain all their 'limit points'
(the limit of every converging sequence is also 'real').

5) They are ordered.

Many of these properties are not satisfied by computer arithmetic,
see the chapter on errors in floating-point computations for a short
review on properties that stay true in floating-point arithmetic.

In order to crunch quickly a lot of numbers, computers need a fixed
size representation of real numbers, that way the hardware can
efficiently perform the arithmetical operations.

The problems arising from using a fixed size representation are the
subject of the following chapters.

Finite number systems are discrete
----------------------------------
If you use a fixed size representation, let's say N binary digits (BITS)
long, you have at most  2**N  bit-patterns, and so at most 2**N
representable numbers.

Such a finite set will have to be bounded - have a largest number and a
smallest number. We have already one problem, our computations must not
exceed these bounds.

In every bounded segment, there are infinitely many real numbers, but we
have at most 2**N available bit-patterns, so many real numbers will have
to be represented by one bit-pattern.

Of course one bit-pattern can't represent many numbers equally well, it
will represent one of them exactly and the others will be misrepresented.

We call numbers that can be represented exactly, FLOATING-POINT NUMBERS
(FPN), the term 'real numbers' will be reserved for the mathematical
constructs.

Roundoff errors are unavoidable
-------------------------------
Before we begin to study actual representations of real numbers,
let us develop a little an idea mentioned in the previous section.

We said that in a finite number system, many real numbers will have
to be represented by one bit-pattern, and that bit-pattern will
represent exactly only one of them. In other words many real numbers
will be 'rounded off' to that one bit-pattern.

This 'rounding off' may occur whenever we will enter a real number to
the computer (except in the rare case we will enter an exactly
representable number).

The same 'rounding off' may occur whenever we perform an arithmetical
operation. The result of an arithmetical operation usually will have
more binary digits than its operands, and will have to be converted
to one of the 'allowed' bit-patterns.

To make this more concrete, let's have an example using base 10
real numbers, and suppose that only two digit mantissas are allowed
(the fractional parts may have only 2 decimal digits):

0.12E+02 + 0.34E+00 = 12.00E+00 + 0.34E+00 = 12.34E+00 ==> 0.12E+02

This example is a bit artificial and incompletely defined (in our fixed
representation, only the size of the fractional part was specified,
the exponents were left unspecified), but the idea is clear, we can
see that computer arithmetic has to replace almost every number and
temporary result by a rounded form.

X + Y

We will really compute:

round(round(X) + round(Y))

The function 'round' can't be specified in general, it depends on the
representation and the floating-point arithmetical algorithms we use,

A possible implementation of round() for decimal floating-point numbers

e = INT(LOG10(X) + 1.0)               (number of decimal digits in X)

INT(X * (10**(p-e)) + 0.5)
round(X) = ----------------------
10**(p-e)

The parameter p is the number of decimal digits in the representation.
Note that multiplying and dividing by (10**n) are just shifts of the
decimal point, and not error generating arithmetic operations.

Such seemingly complicated formulas can be implemented efficiently
(in radix 2) in hardware or reduced to a very small micro-code program
executed by the CPU.

In the following sections we will see that roundoff errors are an endless
source of errors, some of them unexpectedly large.

By the way, the distinction between real and floating-point numbers can
be summarized symbolically in our new notation by:

FPN = round(REAL)

A little basic theory
---------------------
Every real number x can be written in the form:

x  =  f  X  (2 ** e)

Where 'e' is an integer called the EXPONENT, and 'f' is a binary
fraction called the MANTISSA. The mantissa may satisfy one of the
normalization conditions:

1     <=  |f|   <  2          (IEEE)

1/2   <=  |f|   <  1          (DEC)

The mantissa is then said to be a NORMALIZED.

The IEEE normalization condition is equivalent to the requirement
that the MOST SIGNIFICANT BIT (MSB) in the mantissa = 1.

The DEC condition requires the two most significant bits to be 0,1.

On IBM 360, IBM 370 and Nova (Data General) computers, the base of the
exponent was 16 (it gives a larger range at the cost of precision):

x  =  f  X  (16 ** e)

The normalization condition was that the first HEX digit of the fraction
was not equal to 0, i.e. not all first 4 binary digits were 0.

The advantages of normalizing floating-point numbers are:

1) The representation is unique, there is exactly one way to
write a real number in such a form.

2) It's easy to compare two normalized numbers, you separately
test the sign, exponent and mantissa.

3) In a normalized form, a fixed size mantissa will use all
the 'digit cells' to store significant digits.

4) The IEEE and DEC normalization conditions makes the
be omitted, and its place used for data. The omitted
bit is called the "hidden bit".

The normalized representation is used in almost all floating point
implementations, 'denormalized numbers' are used only to minimize
accuracy loss due to underflow (see next chapter).

Just like with rounding, we will have to normalize after arithmetical
operations, the result wouldn't be normalized in general.

Floating Point numbers in practise
----------------------------------
In our finite machines, we can keep only a finite number of the binary
digits of 'f' and 'e', let's say 'm' and 'n' digits respectively.

The vendor predetermine a few combinations of 'm' and 'n', usually one
or two combinations that the hardware executes efficiently, and maybe
one more that gives better precision.

The following table compares some floats used in practice, the REAL*n
notation is a common extension to FORTRAN, 'n' is the number of bytes
used in the representation. The representation radix, size (in bits)
of the various parts composing the floating-point number, and the
exponent bias are given.

The number of bits in the fraction part is counted without the
"hidden bit", if normalized mantissas are used, so the sizes here
are "physical" rather than "logical".

Table of float types (incomplete)
=================================

Float name          Radix  Sign  Exponent  Fraction   Bias
----------          -----  ----  --------  --------  -----
IBM 370:
*  REAL*4             16     1        7        24        64  0.f * 16**(e-64)
*  REAL*8             16     1        7        56        64

VAX:
*  REAL*4 (F_FLOAT)    2     1        8        23       128  0.1f * 2**(e-128)
*  REAL*8 (D_FLOAT)    2     1        8        55       128  0.1f * 2**(e-128)
*  REAL*8 (G_FLOAT)    2     1       11        52      1024  0.1f * 2**(e-1024)
*  REAL*16(H_FLOAT)    2     1       15       113     16384  0.1f * 2**(e-16384)

Cray:
Single precision    2     1       15        48     16384
Double precision    2     1       15        96     16384

IEEE
*  REAL*4              2     1        8        23       127  1.f * 2**(e-127)
extended            2     1       11+       31+
*  REAL*8              2     1       11        52      1023  1.f * 2**(e-1023)
extended            2     1       15+       63+
REAL*10             2     1       15        64     16383

Intel (IEEE):
*  Short real          2     1        8        23       127  1.f * 2**(e-127)
*  Long real           2     1       11        52      1023  1.f * 2**(e-1023)
Temp real           2     1       15        64     16383  0.f * 2**(e-16384)

MIL 1750A:
REAL*4                   None      8        24      None  f * 2**e
REAL*8                   None      ?        ??      None

HP 21MX:
Varian:
Honeywell:

Remarks:

1) Formats that use a sign bit (all except MIL 1750A),
use the sign convention:   0 = +, 1 = -
MIL 1750A uses a 2's complement mantissa with a
2's complement exponent.

2) '#' at the first column means that normalized mantissas
are used. Note that on IBM 370 the first hexadecimal
digit of the fraction (4 bits), couldn't be zero.

An important note
-----------------
The next chapter will provide a detailed example that will make the
abstract concepts more clear. To simplify our discussion, we will
give an incomplete treatment of this highly technical subject, and
with no proofs.

Readers interested in a deeper treatment of these subjects are
referred to:

Goldberg, David
What Every Computer Scientist Should
ACM Computing Surveys
Vol. 23 #1  March 1991, pp. 5-48

+---------------------------------------------------------------------+
|     SUMMARY                                                         |
|     =======                                                         |
|     1) x = f X (2 ** e)     2 > |f| => 1    b is integer            |
|     2) There are a lot of float types                               |
|     3) IEEE/REAL*4 = 1 Sign bit, 8 exponent bits, 23 mantissa bits  |
+---------------------------------------------------------------------+

```