FORTRAN Performance Tuning co-guide

Copyright (C) 1998 Timothy C. Prince

Freely distributable with acknowledgment



Chapter 7
Some Current Architectures; their influence on Fortran programs

alpha:  a Digital Equipment/Intel architecture noted for high clock speed and long pipelines
Operating systems: VMS, NT, Unix, Linux
Usually uses 3 levels of cache, uses double precision (no single) floating point registers, has hardware implementation of SIGN() (f95 style) so that ABS(x) is the same as SIGN(x,1.).

HP-PA (Hewlett-Packard Precision Architecture)
Operating Systems: Unix, NT
Uses single level direct mapped separate instruction and data cache, 31 double precision registers accessible as pairs of single precision registers, plus (PA-8000) equally large set of shadow registers, and out-of-order execution. Has fused multiply-add with u-nrounded multiplication (PA8000) Add and multiply of PA8K have 3 cycle latency, and PA8K has 2 parallel divider/sqrt units.
PA8000 favors strongly:
    stride 1 inner loops
    extensive loop fusion and 2 level unrolling
    invariant-if code movement
    expression ordering for multiply-add chaining and parallelism
    detailed testing of best unrolling for each loop
f90 support is a bit weak, and may remain so until a unified Intel/HParchitecture comes to market.

Pentium Pro/II (Intel)
Operating systems: all Microsoft, Linux, Unix
Has small but effective 2 level cache with separate 8K instruction and data at level 1, 64-bit access to cache depending on data alignment. Floating point registers are 80-bit double extended to protect accuracy of operations such as complex arithmetic, log(), and ASCII to binary conversion. The 8 programmable registers are backed by 32 shadow registers and out-of-order execution.  Latencies are 2 cycles add, 5 multiply, with multiply allowed to initiate only every other cycle. Many compiler vendors are active, with excellent f90 support, fast compiling, but lagging in optimization. Some weaknesses are slow dynamic memory allocation and sloppy timing in Microsoft OS, particularly when using the gnu ports like cygnus.

MIPS/SGI R10000
Operating System: Unix
large 2 level cache, separate 32K level 1 instruction and data, 31 floating point registers with no extended precision, equal set of shadow registers and out-of-order execution.  Fused multiply-add of earlier model sis replaced by IEEE-compliant single instruction multiply and add with full rounding.  Latencies are 2 cycles for multiply and for add.  Hardware cache miss and other event counters are used in the profiler, which is available for use with any compiler.  EPC and SGI both market compilers. SGI's compilers use software pipelining rather than the more common group unrolling.  Recent compilers use pseudo-instructions extensively in apparent readiness for future models which may have direct paths between integer and floating point registers.
Weaknesses: SGI compilers tend to generate excessive spills rather than using shadow registers effectively.  Default compiler options are somewhat strange; -OPT:IEEE_comparisons=ON:fold_reassociate=OFF is required to avoid unexpected re-interpretation of source code.
Strengths:  few latency bottlenecks in hardware, some good compilers available along with the bad, reputation for excellent graphics (including ready-to-run free Internet availability) and multi-processor shared memory systems.

PowerPC, an architecture developed by IBM, Motorola (and Apple?)
Operating systems: Unix (IBM only), MacOS and Linux (Apple only)
Normally uses 2 levels of cache, separate level 1 data and instruction cache.  Floating point register format is double precision only, with 32 program accessible registers and 4 "rename" (shadow) registers to support out-of-order execution.  When not compiling in backward compatibility mode, there are special instructions to support MAX/MIN (used in g77) and SIGN (not in g77).  The mass produced models lack hardware sqrt() and integer to double precision conversion.  603 style models are slow in double precision and integer multiplication.
There are fused multiply/add instructions without intermediate rounding; typically they are not used by compilers in the common A - B*C situation, possibly because this use would not be compatible with directed rounding. Choice of compilers is limited, and the future of this architecture is uncertain.

SPARC: Sun and Sun-licensed manufacturers
Operating systems:  Unix, linux
32 double precision registers accessible as pairs of single precision
variety of compilers available, with long-standing gnu connection