FORTRAN Performance Tuning co-guide

Copyright (C) 1998 Timothy C. Prince

Freely distributable with acknowledgment


Chapter 4
Looping Constructs

FORTRAN, more than any other programming language, emphasizes the counting DO loop where the number of iterations is determined exactly by the time the loop begins, with the only possible exception being a control transfer out of the loop. Many compilers exert loop optimizations only for counted loops without EXITs.

In many cases, f90 allows exactly equivalent expression by array assignment e.g.

x(1:n-1) = y(2:n) - y(1:n-1)

or by the DO loop of earlier FORTRAN versions. With the better compilers, there is no difference in efficiency of generated code; however, this appears to be a deceptively difficult case for optimization. With less intelligent compilers, the following code may work better on almost any typical scalar register architecture:

temp = y(1)
do i=1,n-1

enddo

or it may even be necessary to use another scalar temporary to make sure that the compiler doesn't protect against the possibility that x(i) may overlap with (and destroy the contents of) y(i+1). Even with the fanciest out-of-order execution hardware, if the compiler does not recognize that each operand is used twice, the generated code will take 50% longer to run than optimum. The advantage of a good compiler here is the possibility of generating optimum code from a concise expression of the algorithm, saving a lot of visual clutter and possible problems. We should point out that writing code to avoid repeated memory access pays off only in unusual situations. What's unusual in this situation is that even the optimized code has 2 memory accesses per floating point operation, while it is usual for current architectures to slow down when there is more than 1 memory access per FLOP.

As a more general type of loop iteration is needed from time to time, most compilers have long provided a DO WHILE...END DO loop. The f90 form DO...ENDDO is exactly equivalent to the uglier but more commonly available form DOWHILE(.TRUE.)...ENDDO. A good compiler (even g77) will allow for a loop to be compiled with a test at the bottom only (so that it always executes at least once) using a template of the form

done = .false.
dowhile(.not. done)

enddo

as the standard optimization converts the general DO WHILE(condition) to effectively:

if(condition)then

endif

This cuts the number of branches nearly in half, counting those which are conditional and not taken.
 
 
 
 
 



Loop Fusion and Fission

Fusion is common terminology for combining 2 or more DO loops into one.  Fission is splitting a loop into simpler ones. There is historical baggage here.

In the early days of vector computers, compilers recognized vectorizable patterns only for loops which performed operations corresponding to the vector instructions of the target architecture. Compilers had to become capable of splitting out vector operations.

Meanwhile, a tradition was developed of writing source code with as many separate vectorizable loops as possible. At that time, even architectures which had multiple floating point registers had no more than 8 of them, and a dominant architecture (VAX) had 4 double width registers, so the advantage which could be taken of multiple data use was limited.

When pipelined architectures began to penetrate the affordable computer market, beginning with Reduced Instruction Set chips such as SPARC and MIPS, it was natural to look for strategies to take advantage of compiler technologies carried over from super computers. As certain architectures were short of registers, loop fission strategies often remained useful.  To emulate a vector chained loop with a chain of latencies amounting to perhaps 16 instruction cycles, and unroll that loop 4 times, to keep pipelines full, could occupy a lot of registers.
 
Array assignment syntax gives renewed emphasis to the need for compilers  to fuse loops. In this author's opinion, compilers "should" evaluate fusion for consecutive loops which are no more difficult to deal with than consecutive array assignments; otherwise, array assignment syntax carries a penalty in both generated code size and execution speed. Situations where the evaluation is more difficult should be dealt with in the source code.

Livermore Loops Kernel 14 is an example of code which was written with separate loops, presumably to facilitate partial vectorization. All modern compilers do better with the loops fused. Compilers which perform fusion automatically cannot be expected to fuse the final loop which is of the form

do k=1,n

enddo

because results may be affected by changing the order of evaluation.  Occasionally, a result of the previous iteration is read back and modified.  When compilers see such a potential dependency, they tend to shut off optimizations, rather than continuing to analyze which transformations could still be performed. The latencies associated with waiting for the previous iteration result can be used effectively to begin subsequent iterations, when the loops are fused. This loop is full of situations where individual index array elements (such as ir(k) ) must be kept in registers for efficient execution. g77 (as of version 0.5.21) does not perform this optimization, which could be a hold-over from the time when benchmarks were written entirely in f66.

In this author's experience, when there are 63 or more floating point registers, plus 31 or more integer/pointer registers, as with single precision on the Sun SPARC and HP-PA architectures, it is advantageous to go all out for loop fusion, even when there are separable loops which do not share data. This raises issues in that the same techniques may not work to optimize both single and double precision code. This observation is supported by the architecture designers' decision to introduce shadow registers and hardware remapping, even for architectures with 31 programmable floating point registers.

The reasons for combining as many operations as possible are cogent, and some are quite traditional.

1. Loop control overhead is shared among a greater number of operations, including the startup time needed to prepare to execute a loop and fill the pipelines. This can be as large with recent RISC architectures as it was with the better vector architectures.

2. Advantage can be taken of sharing to keep data at a higher level in the memory hierarchy, whether in registers or in cache. Fusion usually reduces the number of register reloads and cache misses, provided that it does not cause register spills.

3. Reduced need for unrolling and cache blocking greatly reduces expansion of generated code. Code generation does not need to be tuned to loop count or to optimize local speed against adverse side effects caused by instruction and data cache hogging.

4. Unused instruction slots left open by latencies in one sequence of operations can be used to complete other operations in parallel. Instruction Level Parallelism is highly dependent on this. Where there is a recursive dependency (e.g. sum accumulation), pipelines may be kept full more effectively this way, and it is even possible for fused loops to execute as fast as the slowest of the individual loops would.

5. Opportunities are increased for out-of-order execution to find useful work to do.

Contrary indications, which make split loops more optimal, also may be in effect.

1. Fusion would be done at the expense of less effective cache data management. For instance, it might require operations to use an unfavorable stride (operate on non-consecutive memory locations), or it may place unrelated blocks of data in cache simultaneously.

2. Register spills may be incurred. When there are no common data elements, this will be detrimental. A spill involves multiple memory access, and so is much worse than failure to optimize repeated array element access.  Compilers which are designed for a family of architectures which may still include models without shadow register remapping and out-of-order execution may generate register spills where they are unnecessary.

3. Code with branches may interfere with the pipelining of other operations, which could be done effectively in a separate loop. Systems which have effective predication strategies may pipeline well where others do not, but extreme predication may spend too much time calculating results which end up being discarded, or may incur register spills.

You may notice that the case where the same data are stored in multiple places is not a case for fission, in spite of the way many compilers operate.

Where compilers perform fission, the result may leave some loops too small for effective pipelining. For this reason, some compilers are intended to perform fusion before fission. This process is helped by organizing the source code with fused loops.
 
 
 
 
 



Promoting Loop Fusion

Loop Renesting Interchange

On cached architectures, there is usually an advantage in organizing nested loops so that array access strides are minimized. With FORTRAN, this means looping over the first subscript in the innermost loop, or on the last subscript in the outermost loop. More often than not, this arranges things so that the outer loops of consecutive loop nests may be combined. In such cases, array assignment syntax should be confined to the inner loop, or, where there is clearly a need for 2 level unrolling, to the 2 innermost loops.

2 level unrolling may include a case where one of the loops is so short that it should be unrolled completely:

x(1:n,1:2) = y(1:n,1:2)

Livermore Kernel 21 is written thus (with apology for the failure of HTML processors to preserve new lines and indents):

do k=1,25

enddo

which is a poor order of evaluation for cached architectures. It may have been a good one for speed on vector architectures without cache. If n > 25, this makes the longest loop the inner one, and places the sum accumulation recurrence in the outermost loop. Such cases make it mandatory for compiler optimizers to perform automatic re-nesting. In this case, the best "solution" might appear to be to use the intrinsic function MATMUL(), as the original code makes no attempt anyway to optimize either speed or accuracy. If this loop nest appeared along with another loop of close to the same length over any of the 3 subscripts, it might well be worth while to fuse those loops after re-nesting to make them outermost. We will return to this as an example of 2 level unrolling.

At this point, we might note that matrix multiplication of larger matrices is quite unusual, and that small matrices like (3,3) or (4,4) abound in certain applications, so it is not generally useful to optimize operation on large square matrices (with cache blocking, for instance) at the expense of performance on repeated small matrix operations.

Fusible Array Assignments: Varying Length

Lengthened array assignments appear to be preferable when they match adjacent assignments or otherwise simplify loop starting and ending logic, where extra elements can be corrected afterwards ("compensation code"). For example,

a(:,j)=(/ a(1:k-1,j)-a(1:k-1,k)*c,c,a(k+1:nn,j)-a(k+1:nn,k)*c /)

will not be as efficient as

a(:,j) = a(:,j) - a(:,k) * c
a(k,j) = c

and

rhs(k,1:jm-3)= h(i,2:jm-2,k)
t1(k,1:jm-3)= -bjj
t2(k,1:jm-3)= 1.+2.*bjj
t3(k,1:jm-3)= -bjj
t2(k,1:jm-3:jm-4)= 1.+bjj
t1(k,1)= 0.
t3(k,jm-3)= 0.

is more likely to produce efficient code than other versions. In each case, we set all the array elements according to the general formula, and follow up by over-writing the boundary elements with their special values.

If assignments of varying length are used, it seems advisable to place them in descending order of length:

rhs(k,1:jm-3)=
t3(k,1:jm-4)=
t1(k,2:jm-3)=
t2(k,2:jm-4)=

We hope to encourage generation of code such as:

j=1
do

enddo

although we know from experience with many f77 compilers that often it is best to write the old style counting DO version which has no if() inside the loop, and next best to have an if() block without exit. In the outer loops, of course, there should be no problem with conditional exit.

In principle, compilers ought to recognize loops where the last iteration terminates early, and treat them as well as loops where there is no EXIT, but that is not in the Kuck tradition.

The author has had frustrating experiences with f90 compilers which indicate that extreme conservatism or patient testing are required with such code. In particular, k should be set in an outer loop rather than turning this into matrix assignment syntax when that is not desired from a point of view of efficiency.
 
 
 
 
 



Outer Unrolling

Outer loop unrolling, sometimes called "unroll and jam" may be an effective way to accomplish several iterations of the outer loop in one pass through the inner loop, without spending as much time as separate iterations would require.

In this connection, it may be worth while to replace a loop containing a dot_product with a matmul. Compilers tend not to optimize such code, as the matrix multiply intrinsic fits many such situations, relieving the compiler of some of the need to know how to generate in line optimized code for multiple dot_product's.

The "jam" is a fusion of the unrolled outer loops into a large inner loop, with accompanying optimizations. As with the fusions discussed above, there may be advantage taken of common expressions, or fill in of empty instruction slots left by latencies in the basic inner loop. Loop startup time is spread over more operations. The technique makes the nesting order of the loops less critical, and helps deal with situations where the various criteria indicate different nesting orders.

Outer unrolling leads to large expansion of generated code, particularly when the outer loop count may not be evenly divisible, so it should be confined to time critical code segments. In the case of evenly divisible loop counts known at compile time, outer unrolling would be preferable in all respects over extra inner loop unrolling.

Livermore Kernel 4 might be rewritten

xz(k-1:k+m+m-1:m) = y(5) * (xz(k-1:k+m+m-1:m) - &
& matmul(y(5:n:5),reshape((/(xz(j+1:j+n/5),j=0,m+m,m)/), (/ n/5,3/))))
 

but anyway the desired effect (for non-vector architectures) is that the outer loop of count 3 should be fully unrolled and fused into the inner loop, avoiding use of dot_product.

It seems difficult to get satisfactory results from matmul. In the case above, with a trailing matrix dimension of 3, this is explained partly by the work of copying data into dynamic storage being larger than thework which might be saved by recognizing re-use of y(). Heavy use of dynamic storage is indicated by the appearance of xz() in ways where the compiler does not know whether it is safe to store intermediate results immediately, as well as the use of RESHAPE. However, if MATMUL is implemented simply as a series of DOT_PRODUCT's, none of the desired advantage in efficiency is achieved.

If matmul is not used for Kernel 21, the inner loop may be written

do k= 1,25

enddo

By the use of outer unrolling, one of each 4 array accesses (and promotion to double, if necessary) is eliminated. Also, the latencies of the additions are brought into balance with the rate of memory access, if, as on several current architectures, at most 3 memory accesses may be performed during the latency cycles of double addition or fused multiply and add, without the use of interleaved sums which increase loop startup time.

In James Van Buskirk's Crout matrix inversion, one of the major loop nests has the form

b(j,j+1:n) = b(n,n-j+1)*(b(j,j+1:n) - matmul(b(n-j+1:n-1,n-j+1),b(1:j-1,j+1:n)))

This has not shown satisfactory speed on systems tested until now, although it sometimes beats the posted code. Some third party software houses are claiming big speed improvements for their libraries. Most of the examples here show multiplication of a rank 1 by a rank 2 matrix, with the size of the second dimension of the rank 2 matrix sometimes small. No doubt these need special case treatment. Compiler libraries may be optimized for the 3x3 or 4x4 case. A possible check for this is to replace MATMUL() with an internal function; this fails with certain compilers.

The first loop nest in that subroutine has a form which could be done with matmul() by RESHAPEing a diagonal band section of the matrix, followed by a loop which subtracts the result from the next diagonal.

temp(j:n) = matmul(b(1:j-1,j),reshape( (/ (b(n-i+1:n+j-i-1,n-i+1), i=j,n)/), (/ j-1,n-j+1 /)))
forall( i = j, n) b(n+j-i,n-i+1) = b(n+j-i,n-i+1) - temp(i)
 

The original form is more concise, and involves no extra data movement, but carries fewer optimization hints. On systems tested so far, this matmul() is very slow, or RESHAPE fails. It's difficult to determine whether RESHAPE is trying to copy the data into a more compact dynamically allocated region, and whether that is desirable on a given architecture.

All dot products in that implementation have been arranged to use unity stride, showing to great advantage on certain systems. The fact that they are all rectangular or triangular matrix-vector products may be taken as a warning that optimum speed cannot be expected from dot_product.

Putting an example given above in the context of its outer loop,

forall(j=1:n, j /= k)

endforall

we find that some systems gain significantly with outer unrolling, and others do not. No f90 compiler we have seen will unroll this automatically, even when written in f77 subset syntax, but, in order to leave system dependent optimization to the compiler, this facility is important. There is a potential speed gain of 20% for outer unrolling by a factor 2, according to elimination of duplicate memory reference, and a further gain from reduced looping overhead. When tiling for cache optimization is employed, that also involves outer unrolling on some systems.

If the problem is small enough, or properly tiled, the vector a(:,k) will be read from fast cache every time after the first. The other vector will go into the slower levels of memory hierarchy. Thus, the outer unrolling strategy is more likely to be useful on a system with a large single level cache, where the duplicate memory accesses which are eliminated by unrolling would have cost as much as the ones which remain. Outer unrolling also could reduce the effective capacity of a small cache.

Splitting the outer loop into sections running from 1 to k-1 and from k+1 to n, as in the official version, facilitates outer unrolling, without being detrimental when outer unrolling does not occur.

This program falls to the temptation to use array assignment syntax to swap vectors. For instance,

b(n - maxlok+1:j:j-n+maxlok - 1,i) = b(j:n - maxlok+1:n - j - maxlok+1,i)

performs the first swap. Unless the compiler recognizes this as a swap which can be accomplished using 2 registers, this way of writing it cannot be as fast on a scalar architecture as the old f77 way, even if the allocation of a temporary vector does not cost time.

The final rotational swap of 3 vectors can be accomplished in a single DO loop, using 1 or 2 scalar temporaries, but requires 2 array assignments with an explicit temporary vector and 2 DO loops in the official version. Even on a vector architecture, it appears more feasible to optimize the single DO loop.

Livermore Kernel 18 is adapted to matrix assignment syntax, encouraging the compiler to unroll over both subscripts. After re-grouping the expressions as suggested above to increase local parallelism, the first such assignment becomes:

za(2:n,2:6)= (zr(2:n,2:6)+zr(1:n-1,2:6))/(zm(1:n-1,2:6)+ zm(1:n-1,3:7))*&
&(zp(1:n-1,3:7)- zp(1:n-1,2:6)+(zq(1:n-1,3:7)-zq(1:n-1,2:6)))

With such complicated expressions, unrolling by 2 on each subscriptis likely to be sufficient. Incidentally, the first 2 loop nests are fusibleon j, the inner loop index, but that is not necessarily useful, as interchanging the loops to make j outer loses more than will be gained by fusion.

Outer unrolling on triangular matrices becomes more awkward, but still important for good performance. Kernel 6 may be treated in a similar way to 21, giving something like

do i= 2,n,2

enddo

except that the dot_product expressions need to be expanded and fused so as to take advantage of the common w() array.
 
 
 
 
 



False Recursion

Occasionally, situations arise where an array is being over-written, where normal compiler analysis fails to determine whether the order of execution may affect the results. The result will be inability to recognize common expressions or parallelism available to optimize generated code. For Livermore Kernel 2, we write out the elimination of duplicate memory references:

temp= x(ipnt+1
do k= ipnt+2,ipntp,2

enddo

We spend less time copying x(k+1) than the compiler would spend re-reading it to guard against the false possibility that x(i) has over-written it. The better compilers will eliminate these copy operations from unrolled loops.
 
Array assignment (not accepted by all compilers) reads

x(ipntp+2:ipntp+ii+1) = x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2)*x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*&
    & x(ipnt+3:ipntp+1:2)

showing more clearly the non-overlap of input and result arrays, in case a compiler were prepared to take advantage and generate code without temporary array allocation.
 



Loop Recursion

While recursion which is written in a loop has much less execution overhead than functional recursion, it seems to have fascinated the authors of the Livermore Kernels, which include many of the commonly occurring situations where each iteration of a loop depends on results of the previous iteration.  Not only did these prevent vectorization in the LK heyday, they require special treatment for good performance on current pipelined architectures.

As mentioned above, dot_product is a mild form of recursion which, in principle, should have its performance problems solved automatically by use of matmul() when that is possible. Since the matmul() option is available, it may be reasonable for dot_product to be implemented on the assumption that the only way to gain parallelism is to split the sum into partial sums which may be evaluated in parallel. This may improve accuracy, but all that is guaranteed is that the numerical results will differ from sequential evaluation, unless extended precision accumulation is used. The performance gain from this interleaving of sums may be disappointing. Supposing that a dot product of vectors of size 25 is evaluated with 2 partial sums, this will take 13 iterations of an unrolled loop, plus 2 iterations winding up by adding partial sums and remainder. So the speed-up over sequential code can be at most  66%, on architectures which show a typical speed-up of 100% on fully pipelined code.

The first recursive Livermore Kernel is a plain dot product, with the accuracy evaluation showing a benefit from extended precision accumulation. The single precision version should be written either f77 style

sum=0
do k= 1,n

enddo
q=sum

or f90 style

q=dot_product(dble(z(1:n),dble(x(1:n))))

Unfortunately, this implies that 2 temporary arrays will be used, an extremely inefficient process.

Several modern architectures are designed to implement the dprod() extended precision efficiently.

Livermore Kernel 5 is, as advertised, a typical situation in the solution of tri-diagonal matrix equations, which arise in cubic spline curve fitting as well as in one-dimensional transient heat conduction and 2-d boundary layer. The recommended implementation:

temp= x(1)
do i= 2,n

enddo

The theme of the extended precision register variable is carried over from the dot product. This situation receives special recognition from some of the better compilers, which promote multiply-add chaining across the loop edge. On some architectures, optimized code performs each multiplication twice, once to produce the array element, and again to chain over to the next iteration. This treatment may be counter-productive in contexts where other operations are available to be performed in parallel.

Outer unrolling to take advantage of the matrix-vector multiplication of Kernels 4 and 6 has been covered above. One has the impression that the run-time builders have not worked out the means of implementation of generally fast and accurate MATMUL(). This function probably needs several internal versions of multi-level unrolling so that it does not get caught with short loops.

Kernel 11, the calculation of partial sums of the element of a vector, carries on the extended precision register variable accumulator theme. Most architectures are designed with a low enough ratio of addition latency to memory access rate that it is not necessary to interleave sums, as would be expected from an implementation of the SUM function.

Kernel 17 has a more complicated situation involving a recursion variable, where it's still worth while to separate storing to and reading back from memory from the critical latency path. This author sees little value in testing the ability of a compiler to recognize a disguised plain DO loop, as there is no need to use unstructured code here.

Kernel 20 has a yet more involved recursive chain of latencies. In accordance with a recommendation made above, the clipping comparisons are altered to pipeline them. The inaccurate scheme used to avoid repeated division in the original is removed.

Kernel 23 is speeded up by arranging for better multiply-add chaining and delaying the need for the recursion result from the previous iterationas long as possible without the reduction in accuracy which would be incurred by abandoning the factored form of the final term. As this kernel is recursive in both subscripts, outer unrolling is not a useful solution. Delay one recursion or the other, according to which you intend to make the inner loop.

Kernel 24 is a MINLOC type of operation which again is speeded up by arranging that the critical latency  path does not involve memory reference.  The original code is stereotypical enough that compilers have been setup to recognize it and substitute a BLAS equivalent of the MINLOC type of call; this author has not been seen as good results from that as from the following straightforward code. The loop code should interleave automatically if that is desirable, but out-of-order execution, branch prediction, and register re-mapping all save the compiler from having to do as much work:

m= 1
temp= x(1)
do k= 2,n

enddo

There is no good reason why the code

minres = minloc(x(1:n))
m = minres(1)

should not run nearly as fast, and it has an advantage of conciseness. Common variations like maxloc(abs(x)) must be written out in loops to avoid the inefficiency of storing temporary arrays.

Another type of loop recursion, not represented in standard benchmarks like Livermore Loops, is one which incorporates a special case of the first iteration by setting up the conditions which are peculiar to the first iteration prior to entering the loop, and the conditions for subsequent iterations at the end of the loop. Certain compilers recognize this by peeling off the first 1 or 2 iterations, leaving the remainder more amenable to "vectorization."  An example might be a periodic spline, where the last interval wraps around to the first. A simple example, taking differences of a circular vector:

im1=n
do i=1,n

enddo

This poses little problem for optimization on modern pipelined architectures; with unrolling, the repeated memory references are easily dealt with. Peeling off the first iteration appears to gain nothing except freeing up one integer or pointer register.

Compilers are less likely to peel off the last iteration, although that may allow the removal of a time-consuming conditional from the loop body, or resolve the anomaly of failing to optimize a loop which exits early on the last pass.

If there are 2 iterations of recursion:

im2=n-1
im1=n
do i=1,n

enddo

it's more important to make the loop boundary recursion of the indexing type, as shown. Peeling would make if blocks of the first 2 iterations of the loop, followed by a loop from 3 to n from which the recursion has been removed. Peeling increases loop startup time, so it shouldn't be applied if that time cannot be recovered in a faster loop body.

A likely alternative way to write the above:

im2=n-1
im1=n
df=f(im1)-f(im2)
do i=1,n

enddo

The suggestion that the subtraction should be repeated may make more sense if you consider that a register move may be simply an add of a zero register, and that the "extra" subtraction here can proceed in parallel with and complete sooner than the other operations. The compiler may decide to do it differently, possibly better.

Either way, there are possible recursive latencies due to copying between registers which may be eliminated either by unrolling and register renaming or hardware register remapping.

Sometimes, this treatment may be used to extend a loop, adding a dummy first iteration whose results are over-written by the second, so that the loop may be fused with another. This would happen if these examples were started with im1=1.

Watch out for over-expansion of the peeled code; if an outer loop is distributed into it, it may be unrolled or invariant-if analysis may be applied, all counter-productive.
 
 
 
 
 



Invariant-if Loop Distribution

Generally, there is a speed advantage in moving loops inside if blocks, where the condition is loop-invariant, producing a separate loop for each case. Only the loops which are to be unrolled should be moved inside. Automatic compiler invariant-if analysis tends to overdo it.