Copyright (C) 1998 Timothy C. Prince
Freely distributable with acknowledgment
Before profiling, the program should be fixed up to correct any uninitialized data or array bounds transgressions, using special compilation options to test with a selected suite of test cases.
Tuning may require a special set of cases sized to run a convenient length of time, preferably with array sizes representative of normal runs. These cases should be run with a normally reliable set of conservative compiler options, with profiling, to get a comparison set of results.
Typically, a compiler will have a set of options which sticks to IEEE style arithmetic (including comparisons) if that is possible, performs enough unrolling to get most of the available performance, does simple powers in line, observes parentheses, and produces reliable results. This may take some research, as it may not be one of the basic option packages like -O1. Profiling may require static linking, but it should allow use of compiler optimizations.
If the architecture has fused multiply-add sequence instructions which do not round according to IEEE standards, they probably have to be used for optimization, but the test suite should be run with them turned off to verify reasonably close results. These instructions often produce more accurate results than would be obtained with intermediate rounding, but they also have occasional bad effects, typically in the solution of quadratic equations, or where under-flow occurs.
SGI changed the single-instruction multiply and add from a fused instruction without intermediate rounding on the R8000 to one with IEEE style intermediate rounding on the R10000.
Compilers often have a choice between left-to-right expression evaluation, observing parentheses, and re-ordering in the hope of getting more use of multiply-add chaining, common sub-expressions, or instruction level parallelism. Sometimes these are tied in with loop renesting, which is unfortunate, particularly when it is not possible to get loop optimizations without allowing disregard of parentheses. When there is a re-ordering option, the default should observe parentheses, in this author's opinion. A diagnostic could be issued suggesting a re-arrangement, without actually doing it, so that the programmer's intent is not violated silently.
This author's preference is to use the re-ordering options only as a rough check to see if there are performance improvements to be found. It's usually possible to get a better combination of performance and accuracy with strict evaluation of properly written code. With industry standard benchmarks, where the source code (but not strict interpretation of it) is sacred, the vendors' reasons for taking liberties are understandable.
IEEE vs FORTRAN Rounding
Modern architectures allow considerably faster execution of IEEE style rounding rather than FORTRAN style for functions such as NINT() and ANINT(). In most situations, the IEEE style is at least as useful. The primary difference is in the rounding of numbers such as +- .5, 2.5, 4.5... where FORTRAN dictates rounding away from 0. A fast implementation of ANINT() may roundX to even for X > 1/EPSILON(1d0). Ideally, a compiler will allow a selection option here which is not tied to other options.
Code Size Optimization
Options which control the expansion of f90 code generation are worth trying. They may generate faster as well as more compact code.
Static Library Functions
On systems which require extra time to call a dynamically linked library function, there usually are ways to select static linking for critical functions, without linking in too many undesired static functions. For example, some systems accept linking options like "f90 *.o -B STATIC -lm -B DYNAMIC" which links the math library statically but returns to dynamic linking for the rest. Another possibility is to specify the static version of a library before allowing default linking e.g.:
f90 *.o /usr/lib/libm.a
Of course the location of the library file will vary. Or you may wish to extract the object modules which are called most frequently and include them in your object module list.
Compilers typically have options to specify the maximum amount of unrolling and some limit on size of unrolled code, either in terms of generated instructions (best, but architecture dependent) or source code operation count. Default settings probably are ones which worked in SPECmarks. One or another limit may be too high or too low for your application. For the sections of code which don't show up in the profile, it's probably good to reduce unrolling to get faster compilation and smaller code.
The top limit of useful unrolling is probably related to the ratio of addition latency to instruction issue rate. For example, the PA8000 architecture takes 3 cycles to complete addition, and can issue 3 integer and 3 floating point instructions during this time, so it may be useful to unroll by 6 if there are no parallelizable operations within one loop iteration. This is for the usual style of unrolling, such as the HPUX or gnu compilers perform. Software pipelining compilers such as SGI's may consider unrolling to be the number of loop iterations between count tests, but perform much additional unrolling.
Unrolling by more than the loop count is useless; that ensures that the "optimized" version of the loop won't be executed. Ideally, the loop count should be divisible by the unrolling factor.
The odd iterations (0 to 5 of them when unrolling by 6) may be performed before ("pre-conditioning") or after ("clean-up") the main unrolled loops. When the unrolling factor is other than a power of 2, the clean-up position is preferred, to avoid the considerable penalty incurred in calculating the remainder by division. If this were written out in Fortran source code, using a DO index increment of 6, the division would be performed anyway.
Compilers which accept varying amounts of unrolling may switch between
pre-conditioning and clean-up orders according to the amount of unrolling.
As a result, it may be found that unrolling by 4 is faster or slower than
unrolling by either 3 or 5. Other compilers will choose to unroll
by the nearest even value. If unrolling is controlled loop by loop,
it is likely to be done with directives such as
immediately before the DO.
Certain compilers use the unrolling factor as the sum and comparison interleaving factor. Then, if unrolling is done by more than the minimum amount needed for interleaving, the length of the cleanup code is increased. This may be more than offset by increasing opportunities for out-of-order execution.
Size of unrolled code is an issue both in terms of disk and instruction cache space, and because the compiler runs out of registers in code generation. As mentioned above, it is better (on a system with register remapping) if a compiler simply reuses registers as much as possible, rather than generating spills.
Where outer unrolling is employed, the product of inner and outer unrolling is a useful parameter. For instance, the PA8000 may work well with manual outer unrolling (by 2) and compiler inner unrolling by 3.
Compiler options to turn on or off generation of pre-fetch instructions. A pre-fetch instruction is intended to initiate the movement of a block of data into cache before it is to be used. Typically, a group of pre-fetch instructions will pick up data for 16 loop iterations, in which case the optimum situation is to issue a pre-fetch only once per 16 iterations, best done by a large amount of unrolling. Pre-fetch is most likely to be useful in loops which advance through memory by strides greater than one, but it will increase the demand for cache space, and may drive data out of cache which are about to be needed. The vendors' default options usually work best in most situations.
We suggest avoiding pre-inversions, such as the automatic conversion of x/sqrt(dprod(x,x)+dprod(y,y)) to x * (1/sqrt(...) as these cost accuracy and seldom show a significant improvement in speed. Compile-time inversion of constants which may be inverted exactly is a useful optimization, which should not be bundled with inversion of divisions in ways which result in inaccurate rounding.
Some compilers have an option to generate simplified in-line code for complex arithmetic, failing to take precautions against over- or under-flow. There shouldn't be any problem with IEEE single precision or on Intel architectures, as the way to go is simply to promote the operation into the next higher precision, which has more than enough range. If the promotion is not used, careful testing is indicated.
These options should be enabled only where they are known to be beneficial. The most likely such situation is where a simple function with little internal branching is called in a loop. Such a function may be written internally with CONTAINS if it is too complicated for the old-fashioned statement function.
Certain systems use a faster interface to intrinsic functions when an automatic in-line option is invoked.
Among the standard intrinsics, some of the more likely candidates for
in-line expansion are:
NINT, simplified ANINT:
ELEMENTAL FUNCTION anint(x)
! accept IEEE rounding, and possible large even results
! on extended precision systems, use instead the corresponding EPSILON
Problem: on Intel '387, the only satisfactory way is to use the interal rounding instructions.
TAN() is a reasonable in-line candidate (no branching). Other trig functions, if written in full generality, are not such good candidates for in-line, and should make use of the PURE function treatment even if programmed normally. Intel '387 processors make an entirely different story about in-line intrinsic selection, as discussed below.
Other than the traditional allocation of data in such a way that the
compiler can choose proper alignment (don't misuse COMMON), the benefit
of these is doubtful. In a situation where various types of operation are
performed on the data, special data alignments which speed access in one
place, by improving mapping to cache, may hurt in another. Performance
requirements for alignment may be expected; for example, 64-bit transfers
from L2 cache on the Pentium are possible only for data stored at addresses
which are multiples of 8 bytes. A good compiler will observe these alignments
unless forced by the programmer to do otherwise. A common exception
is gnu compilers, which default to 4-byte alignments.