Floating Point Arithmetic
I am going to review a little bit the idea of floating point arithmetic, these ideas can be found in any elementary textbook on Numerical Analysis, or Knuth’s The Art of Computer Programming Volume II. When necessary I will introduce algorithms if I find them relevant…
OK, so the basic idea behind floating point is that we use a sort of generalization of the rational numbers. A rational number is a number of the form where
are integers with
being nonzero. With our fancy pants base 10 number system, we can use scientific notation to represent numbers like
, or
.
Now here’s a bit of a thing we do in practice with calculations: we keep only so many digits. If we keep only 3 digits, the above two numbers become (in scientific notation) and
.
What we can also do is write these numbers as a sort of “ordered pair” where
is the fractional part and
is the exponent part. We can translate it back to a number by the following relation
So we can translate our two numbers into this “order pair” form as:
Now, in general, if we are working on a computer, we will be working with numbers in base 2. What does this mean? Well, observe we have written 2.14 and 1.21 for the fractional parts. We cannot do this on a computer, we have to translate it to base 2. I don’t know about the reader, but when I was first told this in numerical analysis, I thought “Great, how the in heck do you propose to do this?!” It’s actually quite simple. Lets translate into base 2. We have
.
We’ll go step by step. Step One translate 3 into base 2. It is 11, because where
means we’re using base b, and 3-2=1 so we need to add the equivalent of 1 in base 2 which is 1. Thus 10+1=11.
Step Two The fractional part is always the hard part. I alway forget what to do, but this is the basic trick: we have 0.14159, so we simply multiply by 2:
so the first bit is 0, we multiply by 2 again
so the second bit is 0
so the third bit is 1
the fourth bit is 0
the fifth bit is 0
the sixth bit is 1
the seventh bit is 0
the eighth bit is 0
the ninth bit is 0
the tenth bit is 0
the eleventh bit is 1
the twelfth bit is 1
the thirteenth bit is 1
And so on…as you can tell, this won’t be represented precisely in floating point. Most of the time, we get round off errors. Now, we can represent the number 3.14159 approximately in floating point as
We “normalize” so the first digit is always 1, which means we introduce an exponent
and this can be represented in our form as an ordered pair as
.
Of course it should be noted that because we gave up at the thirteenth bit after the decimal, our number 11.0010010000111 is really translated in base 10 to 3.141479492187500 instead of 3.14159, there is an error of if we just take the difference between the two numbers. And of course, after normalization we find
in base 10.
So there is a bit of error here…If the reader is as unsatisfied with my explanation as I am, then the reader may refer to A Better Tutorial on Converting Decimal to Floating-Point.
To sum up, how can we talk mathematically about these things? Well, the ordered pair has the fractional part which consists of some fixed number of bits, and an exponential part which consists of a fixed number of bits. Why not represent this as the direct sum of two vector spaces over
? That’s what I am going to do for all practical purposes (if this bothers you, and it may possibly bother me later, then just treat it as two tuples).
There is floating point addition/subtraction, and multiplication/division operations. We would like our floating point to be closed under both operations (in other words, if we think of these things as functions which eat in two elements in our set of floating point numbers, it should spit something out that’s a floating point number!). It’s mathematically nice that way! I won’t go into detail about floating point arithmetic, as it’s far too long for me to discuss in any effective detail. The interested reader may refer to A Book on the Subject freely available online.
Floating Point Special Numbers
There are several special numbers we include in the set of floating point numbers, in addition to the set as constructed by the ordered pair of two vectors, we have and
. These are three special “numbers”. Why?
Well, suppose we divide by zero, what happens? We go to hell, but additionally we end up with . If we end up multiplying two “really huge numbers” that can no longer be meaningfully represented by floating point, since we have only a finite number of bits, we then have
. If one of those numbers is negative, we get
. There is a natural question that arises: if we use
bits for the fractional part, what’s the smallest number greater than one that we can represent that? The answer is surprisingly simple: it’s
.
So for a floating point type with 53 bits in the fractional part, we have our be precisely
.
We will use half-precision in base 10 for all example purposes which has 11 digits for the fractional part and 5 digits for the exponent. Note that we can represent only numbers with this precisely, everything else will have roundoff error.
The Two Floating Point Operations
There are two floating point operations we are concerned with: addition and
multiplication, so lets consider both of these operations and what mathematical
structure is formed by the set of floating point numbers with a fixed number of
bits for the fractional part and the exponent part.
Addition
Let the set of floating point numbers equipped with floating point addition be denoted by .
Proposition The double is closed under addition (it may generate infinity or
as a result).
Proposition The double is commutative.
Proposition The double nonassociative.
Proposition There is a unique number such that for any
we have
(Existence of Additive Identity).
Remark We almost have an additive inverse for every element, except infinity and …which is a pity. So it cannot be an Abelian group, since
is not associative, nor is there always an inverse element.
Multiplication
Let us now consider the set of floating point numbers with a fixed number of bits equipped with the floating point multiplication operator .
Proposition The double is closed under the
operator.
Proposition The is commutative.
Proposition The is not necessarily associative due to use of finite bits in the fractional and exponent parts of the floating point number.
Remark The double cannot be a monoid.
Proposition There is an element such that for any
we have
(i.e. we have the multiplicative identity).
Remark The double is unital.
Proposition The floating point multiplication operation does not distribute over the floating point addition operation, due to the finiteness of bits used in the floating point number.
Remark Sadly, due to the lack of distributivity, we cannot even have this form a nonassociative Ring.
Remark If we get rid of , and insisted on the following rules to hold:
We can then have the double be an Abelian group. Further, the double
has a multiplicative inverse. So it forms an Abelian group too. The lack of distributivity is what’s preventing this from being a nonassociative ring! Wo is me!