FORTRAN Performance Tuning co-guide

Copyright (C) 1998 Timothy C. Prince

Freely distributable with acknowledgment


Chapter 3
Some Local Optimizations

First, there are some optimizations which a good compiler should make automatically, to save us from having to make minor changes to suit the architecture of the moment.

A common question is whether to use an expression such as x*2 or (x+x).  Many compilers automatically convert one to the other. With typical Reduced Instruction Set Architectures, even the constant 2 occupies a register which might be wanted for other purposes, so the most common choice is to replace all floating point multiplication by 2 with add. Integer multiplication by constant is automatically replaced by a combinations of shift and add, when that saves time or register use. Compilers often have tables to determine which constants are to be replaced by shift-add expansion.

Compilers also should replace division by constant with multiplication, when that can be determined to produce identical results (as when the divisor is an exact power of 2 and there is no overflow). It may be adequate to check 1/const*const -1 == 0, using whatever extra precision the system has.

Small integer powers should be expanded in line, with advantage being taken of common factors. Typically, powers up to 64 are treated this way.  Not all compilers do this. That problem does not always have a solution, but it increases the importance of writing polynomial expressions correctly, as follows:

Horner's rule is often recommended for decreasing the number of operations required to evaluate a polynomial. In the common situation where the low order terms are the largest (as with a rapidly converging Taylor series), this also improves accuracy:

Rather than evaluating

y = A + B*x + C*x**2 + D*x**3

Horner's rule requires

y = A + x*(B + x*(C + x*D))

A good compiler will generate code to evaluate the first alternative with 5 multiplications. A poor one may use 8 multiplications and some function calls. The order of additions is likely to be unfavorable for accuracy.  Horner's rule gives 3 multiplications and fixes the order of evaluation to be that which is desired for accuracy, in the case of the rapidly converging series. Modern computer architectures are optimized for pairs of multiply and add operations, so the time required is nearly proportional to the number of multiplications, so long as the number of additions is no greater.

If any of the coefficients is 1., the corresponding multiplication should be distributed, as this improves accuracy without increasing the number of operations. Setting up series and rational approximations with a leading x term is an important technique for preserving accuracy in math functions.

Horner's rule is unnecessarily slow on any modern architecture with Instruction Level Parallelism. The expression

y = A + B*x + x**2*(C + D*x)

involves 4 multiplications, but executes up to 50% faster when there is hardware parallelism. When the terms are of random magnitudes, this expression is likely to be more accurate than the preceding two.

Following this pattern of grouping in pairs (of pairs...) of terms, powers of 3,4, and 8 are fairly common, so it is important for compilers to expand them in line, and for the arithmetic to have enough range so that at least EPSILON(x)**5 does not under flow. This is one of the properties mandated by IEEE P754 and P854. On older architectures, where EPSILON(x)**3 might under flow, it was not advisable to depart from Horner's rule, nor was there Instruction Level Parallelism to consider. While the IEEE standards require satisfactory behavior upon under flow, there is typically a time consuming trap to a software handler.

Parallel Comparison

Other situations arise where an expression may be written with more available parallelism. Often, in an expression such as

temp=max(a,min(b,x))

b is known to be larger than a, so that

temp=b
if(b>x)temp=max(a,x)

produces the same result, without making one comparison dependent on the other. This is an example (not a very interesting one) where speculative assignment may be preferred. A statement which is simple enough to be implemented in one instruction should take less time and code space if it is executed and over-ridden than if it is placed in an else clause.

The advice just given is not always good; in fact, the first one probably is better when there is plenty of other stuff to be done in parallel.

Sub expression Parallelism

Most compilers use left to right evaluation of arithmetic, so

a/b*(c+d)

allows parallelism, while

a*(c+d)/b

does not. Such rearrangements may change results slightly, unless a or b is a power of 2, but not nearly as much as the pre-inversion tactic

a*(c+d)*(1/b)

The suggested arrangement is more likely to gain on architectures which allow division to proceed in parallel with other floating point operations.  Similar considerations apply to other groupings for parallelism e.g.

a - b + (d - c)

which should, when possible, be set up so that a and b, and c and d, are of similar magnitude, so that this arrangement preserves as much precision as possible. For instance, (a-b) cannot lose precision when a/2 <= b<= a*2, if the arithmetic is according to IEEE P754 or P854.

Multiply-add Chaining

As multiply and add sequences are the most common pair of floating point operations, many architectures have special acceleration for them. This may take the form of a compound multiply-add instruction, or other hardware bypasses to skip cycles in this sequence. In order to take advantage of this, expressions of 4 or more operations should be arranged with leading addition:

a+b*c+d*e
 
Integer shift arithmetic
 
Compilers are expected to replace multiplication by constant integer with appropriate sequences involving shifts, according to processor dependent  tables of timings.  ISHFT(i,1) would have the same effect as i*2 with the latter being at least as fast.  ISHFT(i,-1) is likely to be faster than i/2, as this works only for positive i, and the compiler can't assume positive i  except possibly in the context of a DO index.