Computer Aided Instructional Module for Numerical Analysis-IIT Madras

Truncation Errors

These are the errors due to approximate formulae used in the computations.

Example :

Assume that a function ' f ' and all its higher order derivatives with respect to the independent variable 'x ' at the point, say x = x₀ are known. Now to find the function value at a neighbouring point, say x = x ₀ + dx , one can use the Taylor series expansion for the function f at x₀ + dx as
f (x₀ + dx) = f (x₀) + dx * f ' (x₀) + dx ² / 2! * f '' (x₀) + . . .

the right hand side of the above equation is an infinite series and one has to truncate it after some finite number of terms to calculate f (x₀ + dx) either with computer or by hand calculations. Hence the value obtained is only an approximation to f (x₀ + dx).

Numerical Example :

Let f (x) = x + e ^x ,    At x = 0.5 f (x) = 2.14872127070013

    The Taylor series expansion for f (0.6) = f (0.5 + 0.1) is given by
    f (0.5 + 0.1) =

    f (0.5) + (0.1) * f ' (0.5) = 2.41359339777014
    f (0.5) + (0.1) * f ' (0.5) + (0.1) ²* f '' (x) / 2! = 2.42183700412364
    f (0.5) + (0.1) * f ' (0.5) + (0.1) ²* f '' (x) / 2! + (0.1) ³* f ''' (x) / 3! = 2.42211179100209
    f (0.5) + (0.1) * f ' (0.5) + (0.1) ²* f '' (x) / 2! + (0.1) ³* f ''' (x) / 3! + (0.1) ⁴* f '''' (x) / 4! = 2.42211866067405

The exact value correct upto 14 decimal places is 2.42211880039051

Round off Errors

These are the errors which arise as a result of rounding or chopping the numbers by the computer. To understand this rounding or chopping of numbers by the computer one needs to have a knowledge of represtation of real numbers in computers. In general, the real numbers are stored as floating point quantities in computers; i.e., the fixed number 39.428 is stored as 0.39428*10²(normalized form) or 0.00248 is stored as 0.248*10^-2.

Though Different computers use slightly different techniques, the general procedure is similar. The floating point numbers have four things to understand,

1. base
2. sign (requires one bit)
3. fractional part or mantissa ( consumes large number of bits of the available 32, 64 or 80 bits)
4. Exponent part or characteristic.

The first one represents the number system ( binary, octal, decimal or hexadecimal) which does not need any space to store. The rest of the three parts have a fixed length that is often 32, 64 or 80(or more) depending on the precision of the representation. Symbolically the same is represented as

± .d₁d₂...d_p * B^e

where d_i's are digits or bits with values from zero to B-1 and
B = Base that is used
P = the number of significant bits
e = an integer exponent

The significant bits constitute the fractional part of the numbers. In the normalized floating points which uses a binary system, the first digit of the fractional part is always 1. Some systems take advantage of this fact and do not store the first bit, that is gaining one bit of precision. This first suppressed bit is called the hidden bit.

IEEE, VAX, IBM standards are some of the methods followed by the computers to store these floating point numbers. IEEE and VAX uses the binary number system where as IBM uses the hexadecimal system. Let us now see the IEEE system in detail.

IEEE Method	Total # bits	# bits for p	# bits for e	Bias** value	Max. Exponent	Min. Exponent	Largest Dec. No.	Smallest Dec. No.	App. Prec. of Digits
Single	32	23*	8	127	127	-126	1.701E38	1.755E-38	7
Double	64	52*	11	1023	1023	-1022	8.988E307	2.225E-308	16
Extended	80	64	15	16383	16383	-16382	6E4931	3E-4931	19

*plus one hidden bit
**the bias value is added to the exponent for unsigned numbers

(Diagram is not to the scale)

Its clear from the above table that any computer which adopts the IEEE notation   to store the floating point numbers has its own bounds. These bounds for a single precision variable are -(1-2^-24) * 2¹²⁷ to -2^-126 for negative numbers and 2^-126 to (1-2^-24) * 2¹²⁷on the positive side. Also, there exist gaps between numbers in these two intervals. To understand   this let us consider a small number system in binary i.e., B = 2 with p =2, -1 < e < 2. For   this system the possible normalized numbers are either ± .10₂ * 2^e or ± .11₂ * 2^e,    -1 < e < 2. So the smallest and largest numbers with this number system are respectively

-.11 * 2² = - 1 * 2^-1+1 * 2^-2 * 2² = - (1/2 + 1/4) * 4 = -3
   .11 * 2²= 3.

The list of all the +ve numbers are

.10₂ * 2^-1 = 1/4                                  .11₂ * 2^-1 = 3/8
.10₂ * 2^-1 = 1/2                                  .11₂ * 2^-1 = 3/4
.10₂ * 2^-1 = 1 .11₂ * 2^-1= 3/2
.10 ₂ * 2^-1 = 2 .11₂ * 2^-1 = 3
Similarly the -ve numbers are also distributed. The complete list of possible numbers are now

(Diagram is not to the scale)

So for the considered number system, all the numbers lie in the intervals [-3, -1/4] and [1/4, 3] and the above are the only possible numbers. That is any number between 2 and 3 will be chopped off or rounder to either to 2 or to 3. Similarly for the IEEE method also their are gaps in the two intervals though these gaps are very small compared to the one presented above.

It is also not possible to convert some numbers say 0.6, into binary form without some round off errors.

Because of the above mentioned gaps in representation and rounding or chopping in conversions, round off errors are inevitable in computer arithmetic.