FORTRAN Performance Tuning co-guide

Copyright (C) 1998 Timothy C. Prince

Freely distributable with acknowledgment



Glossary of Terms Used in FORTRAN Tuning Co-Guide

alias: a possible hidden dependency, as when 2 arrays overlap or may do so.

alpha: an architecture originated by Digital Equipment as the successor to VAX, production rights sold to Intel in 1997 in package deal settling patent infringement suit against Intel; future development presumably dependent on Compaq.

branch: an instruction which may cause the order of execution of compiled instructions to depart from sequential.

branch not taken: a conditional branch which does not cause the order of execution to depart from sequential; a common condition for forward branches (branches which skip ahead).

C9X: a future C standard, presented for public comment in 1998 Among other things, aims to make C more competitive with FORTRAN. Includes control of float rounding modes, more standard math library functions, ability to declare non-aliased data areas... In turn, some of the new features should show up in gnu and f2000.

cache: Most modern computers have a multiple level system of progressively larger and slower data and program storage. Often, there is Level 1 cache which may be on the CPU chip, perhaps 8K bytes each program instruction and data cache, and Level 2 cache of 256K to 4M bytes. The cache is setup to speed access to data located in main memory immediately following a previous access.

cache line: the block of data transferred into or out of cache.

cache miss: when the data required by the program are not already in the cache, and must be copied from a slower place in the memory system.

clean-up loop: supposing that a loop is unrolled so that it executes a fixed number of loop iterations in each pass, a clean-up loop is needed to take care of the remainder iterations.

Cray, Seymour: the late founder of a once famous company, now owned by SGI, which developed the vector super-computer market, and much of the FORTRAN compiler optimization and f90 implementation techniques; before that, involved at IBM in originating many modern floating point hardware implementation techniques.

directive: a comment line which contains suggestions to the compiler e.g.
CDIR$ IVDEP (Cray style: ignore possible aliasing dependencies)
C*$* UNROLL(3) (Kuck style: unroll next loop by 3)

direct mapped cache: one where the location of data in cache is determined uniquely by their storage location in main memory; may be fast but productive of conflicts where cache lines try to use a small part of the cache and knock each other out of cache.

dynamic: something which changes during the course of execution of a program, according to the data being analyzed or the subroutines currently activated.

egcs (experimental gnu compiler system) a compiler suite including gcc and g77 development versions, with open participation in development.

EPC (Edinburgh Portable Compilers) a company prominent in f90 compilers.

extended precision: a floating point format which carries extra width both in mantissa and exponent, in order to protect against over/under-flow of intermediate results or loss of precision. IEEE double precision serves as extended single, although the minimum requirement is 42 bits, not 64.  Intel, and that endangered species, the 68k series, have an 80-bit double extended format.

f2c: a compiler/translator, derived from original Bell Labs f77, which translates f77 input to C output.

f66: original standard (10 years late) for FORTRAN computer programming compiler.

f77: 1978 ANSI/1980 ISO standard FORTRAN, superseded by f90.
 
f90: short-lived standard, superseded by f95 while still immature.

f95: standard issued at the end of 1997.

fused MAC: a combined floating point multiply and add or subtract instruction, often implemented without intermediate rounding.

gnu: a family of Unix-like software sponsored by Free Software Foundation.

g77: an f77 compiler, originally part of gnu project of Free Software Foundation; also part of egcs development system; the most widely available FORTRAN.

HP-PA: a series of computer architectures developed by HP HPUX: Hewlett-Packard version of Unix operating system.

IEEE: Institute of Electrical and Electronics Engineers: among its many functions, a computer hardware standards making organization.

Instruction Level Parallelism: the ability of a processor to be working on more than one thread of execution by overlapping instructions, typically by a combination of issuing multiple instructions per clock cycle, and instructions continuing to execute while subsequent instructions are issued, with hardware synchronization to stall dependent instructions until "resources" are available.   Degree of parallelism may be characterized in terms such as number of independent arithmetic units times latency divided by rate of instruction execution.

Intel: earliest and biggest player in microprocessor design and manufacture; a series of architectures designed in accordance with IEEE extended precision, a so-called Complex Instruction Set with small number of program-accessible registers and large number of specialized operations and immediate constants (contained in instruction).

Kuck, David: founder of a company known for selling optimizers and optimizing pre-processors for FORTRAN and C compilers.

latency: number of clock cycles required for an operation, from when the operands are available, until the results are available for another operation.

linux: a low priced family of somewhat Unix-like operating systems; often refers to the Intel-only version.

Livermore: a USA government-sponsored laboratory (in Livermore CA).

loop fission: splitting a loop into 2 or more loops.

loop fusion: combining 2 or more loops.

MIPS: a computer architecture owned by Silicon Graphics Inc.

NT: a Microsoft operating system, oriented toward networks of single-user computers, somewhat portable but with abortive support on non-Intel architectures.

optimization: result of analysis by a compiler to determine an efficient instruction sequence to implement a program.

out-of-order execution: ability of a processor to move ahead and execute instructions even though some instructions are being stalled until dependencies are resolved.

pipelining: the characteristic of being able to begin the execution of instructions before previous instructions have completed.

PowerPC: an architecture sponsored by IBM, Motorola, and Apple.

predication: the practice of calculating a result speculatively and keeping it in reserve until the program has determined whether to use or discard it.

pre-fetch: the practice of initiating data transfer in anticipation of later requirements -- a very old scheme as applied to reading sequential data files, now used in filling cache.

ratfor: a translator which converts ratfor, a semi-structured sort of cross between Fortran and C, into Fortran.

recursion: a sequence of operations which depend on results of predecessor sequences i.e. results of one loop iteration are needed in the next.

register: fast temporary data storage location, directly accessible to the corresponding arithmetic processor.

register remapping: ability of a processor to find another register to substitute for one specified by compiled code, to avoid stalling while that register is busy.

set associative cache: a cache which has several possible locations in cache for a given cache line from main memory, to reduce the frequency of cache mapping conflicts.

shadow register: a register which is assigned dynamically to substitute for one specified by a compiled program; a way for hardware to resolve dependencies arising from a program trying to use a register for more than one purpose.

shift: a low-level integer instruction which moves the bits over a specified number of positions; under certain circumstances, equivalent to multiplication or division by a power of 2. In some architectures, there are combined shift and add instructions to assist in low-level expansion of multiplication by a constant.

SPARC: a computer architecture sponsored by Sun Microsystems.

speculative execution: executing code before determining whether the result will be used.

spill: when a program wants to use more registers than are available, it has to store some temporarily; particularly burdensome with out-of-order execution and register remapping.

static: something which is a fixed feature of a program, not depending on what data set is executed.

stride: size of increment in array subscripts, measured in storage units; incrementing the second subscript by one gives a stride of the size of the first dimension.

struct: a translator which takes f66 input and makes ratfor output; part of Unix V7; may be used to restructure obsolescent style Fortran code.

tlb (Translation Look-aside Buffer): largely mysterious to us FORTRAN programmers, a table where the operating system keeps track of which sections of main memory are mirrored in cache, and whether the main memory is out of date; a potential source of serious performance problems when program needs more blocks of data in cache than tlb is sized for.

Unix: a family of portable operating systems, written mostly in C, originated at Bell Labs.

unrolling: a technique employed by the programmer or the compiler, where multiple copies of a section of code are written out; in the simplest case, rather than a loop which executes 2 or 3 times, write out straight  line code.

VAX: a computer architecture, now obsolescent, originated by Digital Equipment Corporation.

vectorizable: a sequence of identical operations on strings of data, with no dependency of any sequence on its predecessor; terminology from the Cray 1 super-computer era; resembles the criteria for f90 array assignments.