ftune.html

FORTRAN Performance Tuning co-Guide

Freely distributable with acknowledgment

Chapter 6
Math Functions

Trigonometric Identities

Many programmers appear unaware of the advantages of avoiding function calls. We see stuff such as sin(atan(x/y) where x/sqrt(x*x+y*y) would be much faster and more accurate. The careful programmer would use x/sqrt(dprod(x,x)+dprod(y,y)). If you're on an Intel architecture, dprod won't make any difference, but it will help make sure your code runs the same on other architectures.

With a properly implemented library, atan2(x,y) will be as as fast as or faster and more accurate than atan(x/y), depending on the argument values.

We suggest adoption of the scheme which implements sin() and cos() in a function which returns both values. Thus, it is valuable to have a compiler which recognizes sin(x) and cos(x) as a form of common sub-expression. This should be no more difficult than the recognition of common sub-expressions in i/j and mod(i,j). One of the quality criteria for implementation of sin/cos is that the accuracy of sin(x)**2+cos(x)**2 should not deteriorate for large x, and this requires that both functions should use the same range reduction. Thus, sin() and cos() share their range reduction, while the rest is largely parallelizable.

Exponentiation

In view of the identity exp(x) = 1/exp(-x), division should be eliminated when it is possible to do so by changing sign in the argument. The same holds for 1/y**x = y**(-x).

The programmer should take advantage of commonality between expressions such as y**x and y**(x-1), where y**x may be replaced by y*y**(x-1). Likewise, x**1.5 is better replaced with x*sqrt(x).

Programmers often forget that FORTRAN makes no guarantee that x**2. will be implemented as x**2 which should be expanded x*x. This gives rise to portability trouble with non-positive x.

A good implementation of x**.5 will satisfy obvious identities such as (x**2)**.5 == x, but an IEEE compliant sqrt() gives better results in general. If sqrt() is implemented in firmware, it should be faster as well. While some compilers may recognize **.5 as sqrt(), FORTRAN makes no general guarantee of this.

Complex

Complex arithmetic make a good case for extended precision, yet libraries often fail to take advantage of it. It's easy enough to provide in-line eligible functions which promote to the next higher precision rather than writing a bunch of special cases as precautions against out-of-range intermediate values:

cabs(x,y) = sqrt(dprod(x,x)+dprod(y,y))

Implementation (General Comments)

Intrinsic functions are usually implemented in C, or at least with a C compatible interface. Technical reasons for this:

1. facilitates mixed language programming, sharing the C libraries

2. C call by value is a useful short-cut for math functions

3. asm() extensions usually are provided in C, so we can do lower level coding, particularly important on Intel which has firmware implementation of many math functions.

Actually, complex functions in f2c and g77 have FORTRAN compatible calls, but any required special functions such as combined sin/cos or sinh/cosh must also be written to be FORTRAN callable if the usual C is to be replaced with FORTRAN.

Certain FORTRANs have some poorly implemented math functions, particularly in single precision. I have tried to supply a reasonable complete set at members.aol.com/n8tm/.

Specific Intrinsic Functions

This section contains some comments on various intrinsic functions, either hints on how to use them or replace them with source code, or qualities which good implementations should show.

ACOS/ASIN

These functions were available only as vendor-dependent extensions, often named ARCOS/ARSIN, prior to f77. For portability, one had to use something like

ACOS(X) = ATAN2(SQRT(1 - DBLE(X)**2),X)
ASIN(X) = ATAN2(X,SQRT(1 - DBLE(X)**2))

Unfortunately, it is not possible to calculate any of these functions in terms of another, without loss of precision. Extended precision on the Intel '387 allows these formulae to work efficiently in the single and double (but not full extended) precision cases.

A good method is to use a series approximation to asin(x) for |x| <=.5 and the transformations acos(x) = Pi/2 - asin(x) = 2*asin(sqrt((1-x)/2)), asin(x) = - asin(-x), as needed for each sub-range. The series may be found by Chebyshev economization of the approximation to asin(x)/x, taking enough terms so that the leading coefficient is greater than 1 - .125*EPSILON(x).

Among the performance advantages of this scheme is that sqrt() is needed in only half of the sub-ranges, and, as all the terms are odd powers of the result of sqrt(), most of the polynomial may be evaluated in parallel with the evaluation of sqrt(). Of course, it is unusual to see an application where the speed of asin() is a significant issue.

The elefunt test suite is an interesting indirect test of the accuracy of sqrt(), as it is sensitive to the style of rounding used in sqrt().

ANINT

This function was not available in f66; the closest would have been

ANINT(X) = AINT(X+SIGN(.5,X))

AINT(), in turn, can be rather complicated on architectures which do not have a special instruction for this purpose. For example, it might be implemented as

FUNCTION AINT(X)
IF(X < 0)then
AINT = FLOOR(1-EPSILON(X)+X)
else
AINT = FLOOR(X)
endif
RETURN
END

The standards committee decided to define ANINT accordingly rather than agree with the decision of the IEEE P754 committee. In order to take full advantage of IEEE compliant systems, one has to accept the idea of rounding to nearest even, which is contrary to FORTRAN standard.

Most current architectures likewise have a round to integer instruction which differs from NINT in the manner of rounding numbers like +- 0.5,2.5, ...

Intel coprocessors have built-in instructions for implementing ANINT() and NINT(). AINT() and INT() are much slower, as the rounding modes have to be saved, changed, and reset, after flushing pipelines. As on the
PPC architecture, there is no short-cut for

i = nint(x)
y = i

so it is faster to use

y=anint(x)
i=nint(x) ! for Intel
i =int(y) ! for PPC, in case nint() is not implemented directly

ATAN2

As suggested above, ATAN should be treated as a special case of ATAN2, and ATAN2 is preferred where there is division in the argument. A good implementation will show results with worst relative error of EPSILON(x) according to elefunt. There is some trading of accuracy in one range against another.

A single precision implementation doesn't require so many terms, and shows a significant performance advantage on many systems. In some implementations, including g77, atan2 in the single precision case can be made to promote the arguments automatically to double precision (traditional C style), producing correctly rounded results for expressions such as atan2(x*2,1-x**2), without the time penalty incurred by full double precision.

A few exceptional cases have inadequately defined results. atan2(0.,0.) was specified as NaN in IEEE P754, and is the easiest result to obtain, as it naturally falls out as 0./0. Now the traditional result of 0. is considered more correct. FORTRAN says the result is "processor-dependent."

The Intel '387 instruction FPATAN implements atan2() efficiently, but unexpected NaN's may be produced when it is written directly in line. The '287 version required preliminary argument reductions which would be slow and less accurate than the firmware included in later versions, so it is important to link an appropriate library.

CABS

On systems with IEEE extended range and firmware sqrt(), there is a big advantage in making CABS() an in-line function. Where there is no extended range, precautions against over/under-flow should be observed. The traditional way is to divide through the arguments of sqrt() by the larger of the arguments, but improvements may be made by using a scale factor which is an exact square.

CCOS, CSIN, CEXP

Functions which return both sin and cos, and sinh and cosh, with one call, are important for performance here.

A common deficiency in sin() shows up with arguments having real part of 110 or 220. The usual problem is failure to correct for the limited precision of the value of Pi used in range reduction. To fix this, either use full extended precision, or feed an error correction into the polynomials. Attention to this is important for applications which use these functions with arguments having real parts outside the first and last quadrants.

Without extended precision, relative errors up to 2.2*EPSILON may be expected in ccos/csin. With extended precision, they can be held to 1.3*EPSILON.

CLOG

This may be the usual worst implemented function in FORTRAN libraries. Instead of the traditional log(sqrt(a)), .5*log(a) should be used, with a scale factor treatment where needed to protect against over/under-flow. Maximum relative errors of EPSILON(x) should be achieved. As this function is required for the complex ** operation, improvements will carry over there.

Using cabs(), relative errors up to 1.8*EPSILON are observed. With the better implementation, these are held to EPSILON. Proper use of Intel '387 built-in transcendental functions allows full accuracy in double precision without loss of speed.

COS/SIN

As noted above, the most common deficiency is in range reduction for arguments outside the 1st and 4th quadrants. drem() functions are no help here.

As the range reduction should be identical for sin() and cos(), and the octant is not even known until range reduction is complete, it makes sense to combine the two functions. With all the operations required to convert the low order bits of a double precision integer to a plain integer without possible overflow, both the sin() and cos() series may be nearly done by the time it's known which result is to be returned as sin() and which cos(), and with what sign. No significant time is saved on Instruction Level Parallel architectures by reducing precision or using separate sin() and cos(), even when only 1 result is to be used.

Without special consideration by the compiler, we can't take advantage of combined sin/cos while using standard source code, but the combined calls may be used in ccos/csin/cexp.

Intel '387 built-in functions use a 66-bit precision value of pi for range reduction, and they work well as a single instruction compiled in line. Arguments with magnitude > 2**64 are passed through without change (surprise!). Errors up to 1.3 ULP are seen within range on the '387, which is excellent, although not as good as Intel claims.

COSH/SINH

Sinh requires a series approximation to preserve accuracy and improve speed for small argument values. We may as well use a corresponding series for cosh, to take advantage of parallelism. When the argument is large enough to use exp(), the commonality between sinh and cosh is evident. As with COS/SIN, many systems can produce both results as fast as either one.

The often recommended procedure of using a function which returns (exp(x)-1) in place of the sinh() series for small values is not as successful, according to the author's trials.

CPU_TIME

The only standard Fortran way to measure processor time. Functionally identical to the Cray style SECOND call, and may also be used to replace etime() and SECNDS() which were typical f77 vendor library calls, or direct calls to clock() in the C library. Closest call in f90 is SYSTEM_CLOCK(), which has undefined integer wrap-around behavior, and measures elapsed rather than processor time. First g77 implementation was faulty.

CSQRT

As this depends on CABS, that must be fixed for best results. Also, time may be lost by failing to get in-line code for ABS() and SIGN() or the C equivalents.

DATE_AND_TIME

We suggest using the standard form of intrinsic calls, replacing them with source code if necessary on a given platform. This one is a little off-beat in that it includes one result (milliseconds field) which is not derivable from any standard C library call and is omitted from many implementations, even those which have a C library call which gives an excellent result. g77 implementations will not allow the optional argument usage using keywords.

EXP

A significant advantage is gained on many architectures by having a single extended precision version, as opposed to always using double precision.

The precautions required to minimize rounding error are intricate. Some of them are missed in publically available source code. Results of much experimentation in matching results with good proprietary libraries are shown in the versions at members.aol.com/n8tm

Prior to f90, it was not possible to write functions such as exp() and log() with any pretense of portability. The EXPONENT FRACTION TRANSFER and SCALE functions address this situation. Unless they produce in-line code, they are unlikely to offer satisfactory performance.

On Intel '387 processors, range reduction is not included in the firmware. If short cuts are taken, NaN may be produced erroneously in cases of under- or over- flow, and arguments outside the double precision range may produce less than double precision accuracy.

Inverse Hyperbolic

These functions have not been included in FORTRAN standards, but they appear in C9X. Their complexity (or lack of it) is similar to the hyperbolic functions. We suggest at least a minimal allowance for cases which will overflow or lose all accuracy if not treated specifically:

function acosh(x)
if(x > 2/epsilon(x))then
acosh = log(x) + log(2d0)
else
acosh = log(x + sqrt(dprod(x,x)-1))
endif
end

function asinh(x)
if(abs(x) > 2/epsilon(x))then
asinh = sign(log(abs(x))+log(2d0),x)
elseif(x**2 < epsilon(x))then
asinh = x ! use series here to widen speedy range and improve accuracy
else
asinh = sign(log(1d0+abs(x)+dprod(x,x)/(1+sqrt(1+dprod(x,x)))),dble(x))
endif
end

function atanh(x)
if(abs(x) < sqrt(epsilon(x)))then
atanh = x ! series here
else
atanh = .5*log(1 + x*2/(1d0 - x))
endif
end

LOG

Prior to f77, this function was named ALOG / DLOG. Some traditional implementations of log() are still around which don't work for subnormal numbers, or don't preserve accuracy for log10(). log(x) should be able to achieve maximum relative errors of EPSILON(x), and log10(x) should be able to beat 2.1*EPSILON(x). Simply calculating log10(x) as log(x)/log(10d0) gives errors around 3.4*EPSILON(x).

At one time it was common to calculate log() as a series in terms of (x-sqrt(2.))/(x+sqrt(2.)), which fails tests of accuracy for x in the neighborhood of 1. A good implementation yields 99.9% correctly rounded results there, without requiring special case treatment.

Single extended precision should be faster than double, without giving up accuracy. 85% correctly rounded results should be expected. For best results, single precision log() should produce a double precision result accurate to at least 3 extra (11 total) decimal places, so that calculations such as exp(log(x)/p) do not lose accuracy. That, of course, is not a standard FORTRAN function, but it is easily done on a number of systems, including g77 with the standard f2c-compatible library interface.

The Intel '387 intrinsic log() works extremely well when compiled directly in line. Contrary to the Intel documents, log10() does not appear to use a 66-bit precision conversion, and the result shows occasional errors in the 2 low order bits, consistent with the discussion above. Calculations such as 2**(log2(x)*y) invoke 67-bit precision (according to Intel) and produce better accuracy than exp(log(x)*y).

MATMUL

As pointed out above, multi-level loop unrolling is needed for good performance, including the case where one of the arguments is rank 1 or "nearly so" (far from square). It should be possible to come up with source code which will perform well on typical Instruction Level Parallel architectures, as long as cache dependencies don't come into play. No doubt there area number of special cases which need to be considered simply based on whether the size in one of the dimensions is small. In a given application, it may well be feasible to use CONTAINS with a version appropriate for that position.

Source code replacement for MATMUL requires one of the the f90-style interfaces: internal function, included in module, or explicit interface block. These require the specialization to one of the 3 forms which MATMUL is permitted to take: rank 1 times rank 2, rank 2 times rank 2, rank 2 times rank 1. The module scheme should allow implementation of generics, where a separate function is to be provided for each combination of KIND and rank. This would allow the programmer to make a more suitable trade between accuracy and speed than the compiler vendor. Situations occur where one compiler runs correctly with source code replacement but not with the intrinsic MATMUL, and another works only the other way around, with the same code.

As industry-standard benchmarks do not yet use f90-style code, there seems not to have been enough incentive to develop these features to a useful state of maturity.

TAN

Some form of extended precision argument reduction is useful. One of the more successful (and slower) is shown in the netlib/cephes implementation. Prior to f77, it was common to use sin(x)/cos(x), which is slow and less accurate. Accurate results for small x depend on using a form such as tan(x)= x + x**3*p(x)/q(x). Contrary to what some think, there should be no infinities, as the argument cannot come much closer than EPSILON to a pole. As mentioned above, TAN() is simple enough to be expanded in line even on architectures which don't have firmware trig functions, although that is seldom more than academic.

Intel's '387 built-in tan() gives good results when compiled separately, but has not worked correctly for this author when compiled in line. '287 compatible code will be slower, as the original instruction, a "partial" tan(), required external swapping of data according to sub-range.

TANH

While a rational approximation for small arguments seems logical, the series approximation has a better behavior (monotonically decreasing terms), and thus better accuracy. The speed trade-off between the large number of terms in a series and the slowness of division is close on many architectures. For 128-bit precision, the rational approximation is faster.

For arguments of magnitude > 5./8., the form

tanh(x) = 1 - 2/(exp(x+x)+1)

works well.