William N. Celmaster Modern Fortran Revived as the Language of Scientific Parallel Computing A Brief History of Fortran The Fortran (FORmula TRANslating) computer lan- guage was the result of a project begun by John Backus at IBM in 1954. The goal of this project was to provide a way for programmers to express mathemati- cal formulas through a formalism that computers could translate into machine instructions. Initially there was a great deal of skepticism about the efficacy of such a scheme. "How,'' the scientists asked, "would anyone be able to tolerate the inefficiencies that would result from compiled code?'' But, as it turned out, the first compilers were surprisingly good, and programmers were able, for the first time, to express mathematics in a high-level computer language. Fortran has evolved continually over the years in response to the needs of users, particularly in the areas of mathematical expressivity, program maintainability, hardware control (such as I/O), and, of course, code optimizations. In the meantime, other languages such as C and C have been designed to better meet the nonmathematical aspects of software design, such as graphical interfaces and complex logical layouts. These languages have caught on and have gradually begun to erode the scientific/engineering Fortran code base. By the 1980s, pronouncements of the "death of Fortran" prompted language designers to propose extensions to Fortran that would incorporate the best features of other high-level languages and, in addition, provide new levels of mathematical expressivity popu- New features of Fortran are changing the way in which scientists are writing and maintaining large analytic codes. Further, a number of these new features make it easier for compilers to generate highly optimized architecture-specific codes. Among the most exciting kinds of architecture-specific optimizations are those having to do with parallelism. This paper describes Fortran 90 and the standardized language extensions for both shared-memory and distributed-memory parallelism. In par- ticular, three case studies are examined, show- ing how the distributed-memory extensions (High Performance Fortran) are used both for data parallel algorithms and for single-program­ multiple-data algorithms. Digital Technical Journal Vol. 8 No. 3 1996 39 lar on supercomputers such as the CYBER 205 and the CRAY systems. This language became standardized as Fortran 90 (ISO/IEC 1539: 1991; ANSI X3.198- 1992). At the present time, Fortran 95, which includes many of the parallelization features of High Performance Fortran discussed later in this paper, is in the final stages of standardization. It is not yet clear whether the modernization of Fortran can, of itself, stem the C tide. However, I will demonstrate in this paper that modern Fortran is a viable mainstream lan- guage for parallelism. It is true that parallelism is not yet part of the scientific programming mainstream. However, it seems likely that, with the scientists' never-ending thirst for affordable performance, paral- lelism will become much more common-especially now that appropriate standards have evolved. Just as I Better readability early Fortran enabled average scientists and engineers I Easier program maintenance to program the computers of the 1960s, modern Fortran may enable average scientists and engineers to Additionally, of course, programmers have the program parallel computers of the next decade. assurance of complete portability between platforms and architectures. An Introduction to Fortran 90 The following code fragment illustrates the simplic- ity of dynamic memory allocation with Fortran 90. It Fortran 90 introduces some important capabilities in also includes some of the new syntax for declaring vari- mathematical expressivity through a wealth of natural ables, some examples of array manipulations, and an constructs for manipulating arrays.1 In addition, example of how to use the new intrinsic matrix multi- Fortran 90 incorporates modern control constructs plication function. In addition, the exclamation mark, and up-to-date features for data abstraction and data which is used to begin comment statements, is a new hiding. Some of these constructs, for example, DO Fortran 90 feature that was widely used in the past as WHILE, although not part of FORTRAN 77, are an extension to FORTRAN 77. already part of the de facto Fortran standard as pro- REAL, DIMENSION(:,:,:), ! NEW DECLARATION SYNTAX vided, for example, with DEC Fortran. & ALLOCATABLE :: GRID ! DYNAMIC STORAGE Among the key new features of Fortran 90 are the REAL*8 A(4,4),B(4,4),C(4,4) ! OLD DECLARATION SYNTAX following: READ *, N ! READ IN THE DIMENSION ALLOCATE(GRID(N+2,N+2,2)) ! ALLOCATE THE STORAGE GRID(:,:,1) = 1.0 ! ASSIGN PART OF ARRAY I Inclusion of all of FORTRAN 77, so users can GRID(:,:,2) = 2.0 ! ASSIGN REST OF ARRAY compile their FORTRAN 77 codes without A = GRID(1:4,1:4,1) ! ASSIGNMENT modification B = GRID(2:5,1:4,2) ! ASSIGNMENT C = MATMUL(A,B) ! MATRIX MULTIPLICATION I Permissibility of free-form source code, so pro- grammers can use long (i.e., meaningful) variable Some of the new features of Fortran 90 were intro- names and are not restricted to begin statements duced not only for simplified programming but also in column 7 to permit better hardware-specific optimizations. I Modern control structures like CASE and DO For example, in Fortran 90, one can write the array WHILE, so programmers can take advantage of assignment structured programming constructs A = B + C I Extended control of numeric precision, for archi- tecture independence which in FORTRAN 77 would be written as I Array processing extensions, for more easily express- DO 100 J = 1,N ing array operations and also for expressing inde- DO 200 I = 1,M pendence of element operations A(I,J) = B(I,J) + C(I,J) 200 END DO I Pointers, for more flexible control of data placement 100 END DO I Data structures, for data abstraction The Fortran 90 array assignment not only is more I User-defined types and operators, for data elegant but also permits the compiler to easily recog- abstraction nize that the individual element assignments are inde- I Procedures and modules, to help programmers pendent of one another. If the compiler were targeting write reusable code a vector or parallel computer, it could generate code that exploits the architecture by taking advantage of I Stream character-oriented input/output features this independence between iterations. I New intrinsic functions Of course, the particular DO loop shown above is With these new features, a modern Fortran pro- simple enough that many compilers would recognize grammer can not only successfully compile and exe- the independence of iterations and could therefore cute previous standards-compliant Fortran codes but perform the architecture-specific optimizations with- also design better codes with out the aid of Fortran 90's new array constructs. But in general, many of the new features of Fortran 90 I Dramatically simplified ways of doing dynamic help compilers to perform architecture-specific opti- memory management mizations. More important, these features help pro- I Dynamic memory allocation and deallocation for grammers express basic numerical algorithms in ways memory management inherently more amenable to optimizations that take I Better modularity and therefore reusability advantage of multiple arithmetic units. 40 Digital Technical Journal Vol. 8 No. 3 1996 A Brief History of Parallel Fortran: PCF and HPF executed simultaneously (provided of course that the hardware allows simultaneous access to instructions During the past ten years, two significant efforts have and data). Technology has evolved to the point at been undertaken to standardize parallel extensions to which compilers are often able to detect these kinds Fortran. The first of these was under the auspices of of parallelization opportunities and automatically the Parallel Computing Forum (PCF) and targeted decompose codes. Even when the compiler is not able global-shared-memory architectures. The PCF effort to make this analysis, the programmer often is able to was directed to control parallelism, with little atten- do so, perhaps after performing a few algorithmic tion to language features for managing data locality. modifications. It is then relatively easy to provide lan- The 1991 PCF standard established an approach to guage constructs that the user can add to the program shared-memory extensions of Fortran and also estab- as parallelization hints to the compiler. lished an interim syntax. These extensions were later This kind of analysis is all well and good, provided somewhat modified and incorporated in the standard that data can be accessed democratically and quickly by extensions now known as ANSI X3H5. all processors. With modern hardware clocked at about At about the time the ANSI X3H5 standard 300 megahertz, this amounts to saying that memory was adopted, another standardization committee latencies are lower than 100 nanoseconds, and memory began work on extending Fortran 90 for distributed- bandwidths are greater than 100 megabytes per sec- memory architectures, with the goal of providing ond. This characterizes today's single and symmetric a language suitable for scalable computing. This multiprocessing (SMP) computers such as Digital's committee became known as the High Performance AlphaServer 8400 system, which comes with twelve Fortran Forum and produced in 1993 the High 600-megaflop processors on a backplane with a band- Performance Fortran (HPF) language specification.2 width of close to 2 gigabytes per second. The HPF programming-model target was data paral- In summary, the beauty of shared-memory paral- lelism, and many data placement directives are pro- lelism is that the programmer does not need to worry vided for the programmer to optimize data locality. In too much about where the data is and can concentrate addition, HPF includes ways to specify a more general instead on the easier problem of control parallelism. In style of single-program­multiple-data (SPMD) execu- the simplest cases, the compiler can automatically tion in which separate processors can independently decompose the problem without requiring any code work on different parts of the code. This SPMD speci- modifications. For example, automatic decomposition fication is formalized in such a way as to make the for SMP systems of a code called, for example, cfd.f, resulting code far more maintainable than previous can be done trivially with Digital's KAP optimizer by message-passing-library ways of specifying SPMD dis- using the command line tributed parallelism. Can HPF and PCF extensions be used together in kf90 -fkapargs=`-conc' cfd.f -o cfd.exe the same Fortran 90 code? Sure. But the PCF specifi- As an example of guided automatic decomposition, cation has lots of "user-beware" warnings about the the following shows how a KAP parallelization asser- correct usage of the PARALLEL REGION construct, tion can be included in the code. (Actually, the code and the HPF specification has lots of warnings about segment below is so simple that the compiler can auto- the correct usage of the EXTRINSIC(HPF_LOCAL) matically detect the parallelism without the help of the construct. So as you can see, there are times when assertion.) a programmer had better be very knowledgeable if she or he wants to write a mixed HPF/PCF code. Digital's C*$* ASSERT DO (CONCURRENT) products support both the PCF and HPF extensions. DO 100 I = 4,N A(I) = B(I) + C(I) The HPF extensions are supported as part of the DEC END DO Fortran 90 compiler, and the PCF extensions are sup- ported through Digital's KAP Fortran optimizer.3,4 For explicit control of the parallelism, PCF direc- tives can be used. In the example that follows, the KAP Shared Memory Fortran Parallelism preprocessor form of the PCF directives are used to parallelize a loop. The traditional discussions of parallel computing focus C*KAP*PARALLEL REGION rather heavily on what is known as control parallelism. C*KAP*&SHARED(A,B,C) LOCAL(I) Namely, the application is analyzed in terms of the C*KAP*PARALLEL DO opportunities for parallel execution of various threads DO 10 I = 1,N A(I) = B(I) + C(I) of control. The canonical example is a DO loop in 10 CONTINUE which the individual iterations operate on inde- C*KAP*END PARALLEL REGION pendent data. Each iteration could, in principle, be Digital Technical Journal Vol. 8 No. 3 1996 41 Cluster Fortran Parallelism !HPF$ DISTRIBUTE X(BLOCK) !HPF$ ALIGN Y WITH X REAL*8 X(1000), Y(1000) High Performance Fortran V1.1 is currently the only language standard for distributed-memory parallel computing. The most significant way in which HPF Y(2:999) = X(1:998) + X(3:1000) - 2 * X(2:999) extends Fortran 90 is through a rich family of data placement directives. There are also library routines and some extensions for control parallelism. HPF END is the simplest way of parallelizing data-parallel appli- The HPF compiler is responsible for generating all of cations on clusters (also known as "farms") of work- the boundary-element communication code. The com- stations and servers. Other methods of cluster piler is also responsible for determining the most even parallelism, such as message passing, require more distribution of arrays. (If, for example, there were 13 bookkeeping and are therefore less easy to express and processors, some chunks would be bigger than others.) less easy to maintain. In addition, during the past year, This simple example is useful not only as an illustra- HPF has become widely available and is supported on tion of the power of HPF but also as a way of pointing the platforms of all major vendors. to one of the hazards of parallel algorithm develop- HPF is often considered to be a data parallel lan- ment. Each of the element-updates involves three guage. That is, it facilitates parallelization of array- floating-point operations-an addition, a subtraction, based algorithms in which the instruction stream can and a multiplication. So, as an example, on a four- be described as a sequence of array manipulations, processor system, each processor would operate on each of which is inherently parallel. What is less well 250 elements with 750 floating-point operations. In known is that HPF also provides a powerful way of addition, each processor would be required to com- expressing the more general SPMD parallelism men- municate one word of data for each of the two chunk tioned earlier. This kind of parallelism, often expressed boundaries. The time that each of these communica- with message-passing libraries such as MPI,5 is one in tions takes is known as the communications latency. which individual processors can operate simultane- Typical transmission control protocol/internet proto- ously on independent instruction streams and gener- col (TCP/IP) network latencies are twenty thousand ally exchange data either by explicitly sharing memory times (or more) longer than the time it typically takes or by exchanging messages. Three case studies follow a high-performance system to perform a floating- which illustrate the data parallel and the SPMD styles point operation. Thus even 750 floating-point opera- of programming. tions are negligible compared with the time taken to communicate. In the above example, network paral- A One-dimensional Finite-difference Algorithm lelism would be a net loss, since the total execution Consider a simple one-dimensional grid problem- time would be totally swamped by the network the most mind-bogglingly simple illustration of HPF latency. in action-in which each grid value is updated as a lin- Of course, some communication mechanisms are of ear combination of its (previous) nearest neighbors. lower latency than TCP/IP networks. As an example, For each interior grid index i, the update algorithm is Digital's implementation of MEMORY CHANNEL Y(i) = X(i 1) X(i 1) 2 X(i) cluster interconnect reduces the latency to less than 1000 floating-point operations (relative to the perfor- In Fortran 90, the resulting DO loop can be mance of, say, Digital's AlphaStation 600 5/300 sys- expressed as a single array assignment. How would tem). For SMP, the latency is even smaller. In both this be parallelized? The simplest way to imagine paral- cases, there may be a benefit to parallelism. lelization would be to partition the X and Y arrays into equal-size chunks, with one chunk on each processor. A Three-dimensional Red­Black Poisson Each iteration could proceed simultaneously, and at Equation Solver the chunk boundaries, some communication would The example of a one-dimensional algorithm in the occur between processors. The HPF implementation previous section can be easily generalized to a more of this idea is simply to add the Fortran 90 code to two realistic three-dimensional algorithm for solving data placement statements. One of these declares that the Poisson equation using a relaxation technique the X array should be distributed into chunks, or commonly known as the red­black method. The blocks. The other declares that the Y array should be grid is partitioned into two colors, following a two- distributed such that the elements align to the same dimensional checkerboard arrangement. Each red processors as the corresponding elements of the X grid element is updated based on the values of neigh- array. The resultant code for arrays with 1,000 ele- boring black elements. A similar array assignment can ments is as follows: 42 Digital Technical Journal Vol. 8 No. 3 1996 be written as in the previous example or, as shown in Communications and SPMD Programming with HPF the partial code segment below, alternatively can use Since HPF can be used to place data, it stands to the HPF FORALL construct to express the assign- reason that communication can be forced between ments in a style similar to that for serial DO loops. processors. The beauty of HPF is that all of this can be done in the context of mathematics rather than in the !HPF$ DISTRIBUTE(*,BLOCK,BLOCK) :: U,V context of distributed parallel programming. The FORALL (I=2:NX-1,J=2:NY-1:2,K=2:NZ-1:2) code fragment in Figure 1 illustrates how this is done. U(I,J,K) = FACTOR*(HSQ*F(I,J,K) + & On two processors, the two columns of the U and V U(I-1,J,K) + U(I+1,J,K) + & arrays are each on different processors; thus the array The distribution directive lays out the array so that assignment causes one of those columns to be moved the first dimension is completely contained within to the other processor. This kind of an operation begins a processor, with the other two dimensions block- to provide programmers with explicit ways to control distributed across processors in rectangular chunks. data communication and therefore to more explicitly The red­black checkerboarding is performed along manage the association of data and operations to the second and third dimensions. Note also the processors. Notice that the programmer need not be Fortran 90 free-form syntax employed here, in which explicit about the parallelism. In fact, scientists and the ampersand is used as an end-of- line continuation engineers rarely want to express parallelism. In typical statement. message-passing programs, the messages often express In this example, the parallelism is similar to that communication of vector and array information. of the one-dimensional finite-difference example. However, despite the fervent hopes of programmers, However, communication now occurs along the two- there are times when a parallel algorithm can be dimensional boundaries between blocks. The HPF expressed most simply as a collection of individual compiler is responsible for these communications. instruction streams operating on local data. This SPMD Digital's Fortran 90 compiler performs several opti- style of programming can be expressed in HPF with the mizations of those communications. First, it pack- EXTRINSIC(HPF_LOCAL) declaration, as illustrated ages up all of the data that must be communicated by continuing the above code segment as shown in into long vectors so that the start-up latency is effec- Figure 2. tively hidden. Second, the compiler creates so-called Because the subroutine CFD is declared to be shadow edges (processor-local copies of nonlocal EXTRINSIC(HPF_LOCAL), the HPF compiler exe- boundary edges) for the local arrays so as to minimize cutes that routine independently on each processor (or the effect of buffering of neighbor values. These kinds more generally, the execution is done once per peer of optimizations can be extremely tedious to message- process), operating on routine-local data. As for the passing programmers, and one of the virtues of a high- array argument, V, which is passed to the CFD routine, level language like HPF is that the compiler can take each processor operates only on its local slice of that care of the bookkeeping. Also, since the compiler array. In the specific example above on two processors, can reliably do buffer-management bookkeeping (for the first one operates on the first column of V and the example, ensuring that communication buffers do not second one operates on the second column of V. overflow), the communications runtime library can It is important to mention here that, although HPF be optimized to a far greater extent than one would permits-and even encourages-SPMD program- normally expect from a user-safe message library. ming, the more popular method at this time is the Indeed, Digital's HPF communications are performed message-passing technique embodied in, for example, using a proprietary optimized communications library, the PVM7 and MPI5 libraries. These libraries can be Digital's Parallel Software Environment.6 invoked from Fortran, and can also be used in conjunc- tion with EXTRINSIC(HPF_LOCAL) subroutines. !HPF$ DISTRIBUTE(*,BLOCK) :: U !HPF$ ALIGN V WITH U REAL*8 U(N,2),V(N,2) V(:,1) = U(:,2) ! MOVE A VECTOR BETWEEN PROCESSORS Figure 1 Code Example Showing Control of Data Communication without Expression of Parallelism Digital Technical Journal Vol. 8 No. 3 1996 43 CALL CFD(V) ! DO LOCAL WORK ON THE LOCAL PART OF V EXTRINSIC(HPF_LOCAL) SUBROUTINE CFD(VLOCAL) REAL*8, DIMENSION(:,:) :: VLOCAL !HPF$ DISTRIBUTE *(*,BLOCK) :: VLOCAL END Figure 2 Code Example of Parallel Algorithm Expressed as Collection of Instruction Streams Clusters of SMP Systems as "Amdahl's Law''). In the case of clustered SMP sys- During these last few years of the second millennium, tems, the weak link would be the inter-SMP commu- we are witnessing the emergence of systems that con- nication and not the intra-SMP (shared-memory) sist of clusters of shared-memory SMP computers. communication. This casts some doubt on the worth This exciting development is the logical result of the of local communications optimizations. Experimenta- exponential increase in performance of mid-priced tion will be necessary. ($100K to $1000K) systems. Whatever else one might say about parallelism, one There are two natural ways of writing parallel thing is certain: The future will not be boring. Fortran programs for clusters of SMP systems. The easiest way is to use HPF and to target the total num- Summary ber of processors. So, for example, if there were two SMP systems, each with four processors, one would Fortran was developed and has continued to evolve as compile the HPF program for eight processors (more a computer language that is particularly suited to generally, for eight peers). If the program contained, expressing mathematical formulas. Among the recent for instance, block-distribution directives, the affected extensions to Fortran are a variety of constructs for arrays would be split up into eight chunks of contigu- the high-level manipulation of arrays. These constructs ous array sections. are especially amenable to parallel optimization. In The second way of writing parallel Fortran pro- addition, there are extensions (PCF) for explicit grams for clustered SMP systems is to use HPF to shared-memory parallelization and also data-parallel target the total number of SMP machines and then extensions (HPF) for cluster parallelism. The Digital to use PCF (or more generally, shared-memory exten- Fortran compiler performs many interesting optimiza- sions) to achieve parallelism locally on each of the SMP tions of codes written using HPF. These HPF codes machines. For example, one might write are able to hide-without sacrificing performance- much of the tedium that otherwise accompanies clus- !HPF$ DISTRIBUTE (*,BLOCK) :: V ter programming. Today, the most exciting frontier EXTRINSIC(HPF_LOCAL) SUBROUTINE CFD(V) for Fortran is that of SMP clusters and other nonuniform-memory-access (NUMA) systems. C*KAP*PARALLEL REGION If the target system consisted of two SMP systems, References each with four processors, and the above program was compiled for two peers, then the V array would be dis- 1. J. Adams et al., Fortran 90 Handbook (New York: tributed into two chunks of columns-one chunk McGraw-Hill, Inc., 1992). per SMP system. Then the routine, CFD, would be 2. "High Performance Fortran language specification," executed once per SMP system; and the PCF directives Scientific Programming, vol. 2: 1­170 (New York: would, on each system, cause parallelism on four John Wiley and Sons, Inc., 1993), and C. Koelbel et al., threads of execution. The High Performance Fortran Handbook (Boston: It is unclear at this time whether there would ever MIT Press, 1994). be a practical reason for using a mix of HPF and PCF 3. J. Harris et al., "Compiling High Performance Fortran extensions. It might be tempting to think that there for Distributed-memory Systems,'' Digital Technical would be performance advantages associated with the Journal, vol. 7, no. 3 (1995): 5­23. local use of shared-memory parallelism. However, 4. R. Kuhn, B. Leasure, and S. Shah, "The KAP Parallelizer experience has shown that program performance for DEC Fortran and DEC C Programs,'' Digital tends to be restricted by the weakest link in the perfor- Technical Journal, vol. 6, no. 3 (1994): 57­70. mance chain (an observation that has been enshrined 44 Digital Technical Journal Vol. 8 No. 3 1996 5. For example see Proceedings of Supercomputing '93 (IEEE, November 1993): 878­883, and W. Gropp, E. Lusk, and A. Skjellum, Using MPI (Boston: MIT Press, 1994). 6. E. Benson et al., "Design of Digital's Parallel Software Environment," Digital Technical Journal, vol. 7, no. 3 (1995): 24­38. 7. For example see A. Geist et al., PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Network Parallel Computing (Boston: MIT Press, 1994). Biography William N. Celmaster Bill Celmaster has long been involved with high-performance computing, both as a scientist and as a computing consul- tant. Joining Digital from BBN in 1991, Bill managed the porting of major scientific and engineering applications to the DECmpp 12000 system. Now a member of Digital's High Performance Computing Expertise Center, he is responsible for parallel software demonstrations and per- formance characterization of Digital's high-performance systems. He has published numerous papers on parallel computing methods, as well as on topics in the field of physics. Bill received a bachelor of science degree in mathe- matics and physics from the University of British Columbia and a Ph.D. in physics from Harvard University. Digital Technical Journal Vol. 8 No. 3 1996 45