M578 - Alexiades
Steps to methodically parallelize your serial code
Master - Workers style
read ALL of this page VERY carefully !

Basic ingredients:

serial code to be parallelized

Makefile for running commands with 'make', see below

machine with some version of MPI installed

for running on a cluster with PBS scheduler you will need: PBSscript file

Sooner or later you may want/need to consult some of these about MPI:
tutorials openMPI MPICH . The official standard is at MPI Forum

Basic steps: ...more details below...

organize, clean up, and debug your serial code, remove interactivity

parallelize your serial code... the hard part...

compile using a Makefile: make compile

execute on your machine using nPROC processes:
mpiexec -n nPROC ./code.x < dat > o.out

or on a cluster (running PBS scheduler) via a script: qsub PBSscript

Some words of advice and caution:

ALWAYS keep backups: make a copy into a "try" dir and then modify.

Parallelize VERY carefully, cautiously, a little bit at a time. Use lots of print statements to see what's happening.

Think carefully... debugging is VERY hard, try to minimize it...

Once a routine is debugged, save it, back it up, and make a copy to modify further
(Using a version-control tool, like 'svn', is recommended).

Insert plenty of comments indicating what is (supposed to be) happening... you'll be sorry if you don't...

Clarity and efficiency are paramount. Avoid coding tricks. Document what you are doing.

Parallelization will be only in the z-direction, so
always use Mz=even number, and divisible by nWRs.

Specific steps for parallelization : On your 1D code first !

[Note on file names: I find it convenient to have all files of a code start with a common 
 letter, like: z.main.f z.io.f ...., and all output files like  o.out o.prof o.hist ....]

1. 1Dserial code:
  a. Place a copy of your serial 1D code in a directory, say, 1Dserial/ .
  b. Organize your code:  Split your serial code into separate files: z.main.f  z.io.f  z.setup.f  z.update.f
     with obvious contents: input/output in io.f, MESH,INIT in setup.f, FLUX,PDE in update.f
  c. Download the file  Makefile_serial.    Save As "Makefile" (must be PLAIN text file!). 
     Look inside it to see what a makefile looks like. It contains macros and directives for 'make'.
     Note that directives (like 'run:') are followed by a line starting with TAB  (not spaces!!!).
     Customize it (set names, compiler, ...).
  d. Try it:  make compile   ( executable will be $(PROG).x )
	      make run	     ( it should run )

2. 1Dparallel code:
  a. Copy (all files of) your 1Dserial code into a new dir, say, 1Dpar 
  b. Comment out all subroutine/function calls. 
  c. Compile each file with: COMPILER -c  z.*   and fix the worst problems.
  d. Make two copies of main.f:  mainMR.f, mainWR.f 
  e. In mainMR.f :  comment out what will NOT be done by MR. Compile.
     In mainWR.f :  comment out what will NOT be done by WR. Compile.

3. Create a new  main.f  which: starts up MPI, calls MASTER() or WORKER() and shuts down MPI
   Here is a sample:
-------------------------- main.f sample -------------------------------
      program main
      include 'mpif.h'		!!(include 'mpi.h' for C)
      ... (declare variables) ...
!>>>mpi>>>
!...startup:
      call MPI_INIT( ierr )
      !....... nPROC is specified at mpirun or mpiexec, see Makefile....
      call MPI_COMM_SIZE( MPI_COMM_WORLD,nPROC,ierr )   !..returns nPROC
       mster = 0		! master gets rank=0
       nWRs  = nPROC - 1	! =number of workers
       !----------------- start 0, ... ,nWRs tasks ---------------!
       call MPI_COMM_RANK(MPI_COMM_WORLD, myID, ierr)    !..assigns myID
       IF( myID = mster ) THEN
                tt0 = MPI_Wtime()       !...start CPU timer on MR
            call MASTER( nWRs, mster, ... )
                tt1 = MPI_Wtime()       !...end timer
           print*,'>>main>> MR timing= ',tt1-tt0,' sec on ',nWRs,' WRs'
       ELSE
            call WORKER( nWRs, myID, ... )  !... now MPI is running ...!
            print*, 'Bye from WR:',myID,': ierr= ', ierr
          if( ierr .NE. 0 ) then
              print*, '>>>> worker:',myID,' ended with ierr= ',ierr
          endif
       ENDIF

!...termination: the only clean way to exit is this:
        call MPI_FINALIZE(ierr)
!<<< mpi<<<
        END
------------------------------------------------------------------------

4. Download   Makefile_parallel.   Save As  'Makefile'.
   Customize it for 1Dpar code (set names, compiler, ...).

5. Parallelization strategy: Domain Decomposition along one direction only (z-direction)

6. Start inserting MPI calls in  mainMR.f and corresponding in mainWR.f
   See the sample coding in "outlineMPI" and look up the syntax of MPI functions.
   Try your hardest to do it correctly the first time!!!
     (in C coding, remove the 'ierr' item from the arguments).
   Insert one MPI call, test it, fix it, then another,...
   Use a very coarse mesh (MM=4 or 8), and lots of print statements 
   to see what's happening (then comment out).

7. Test/correct till it runs with nPROC=2, i.e. nWRs=1, on local machine, 
   (even though nothing is computed yet till step 10 below). 
   This is still essentially serial.

8. Test/correct till it runs with nPROC=3, i.e. nWRs=2, on local machine. 
   This is now parallel !  Most bugs will have been removed by this stage...

9. Test/correct till it runs with nPROC=5, i.e. nWRs=4.
   Tougher bugs will have been removed at this stage...

10. Once the basic MPI operations are working, start adding one 
    routine/function call at a time, test as in 6, and periodically as in 7, 8, 9,
    with lots of print statements.
    Make sure you keep backup copies of each version that works 
    so that you can get back to a working version if worse comes to worse...

11. Exchanging "boundary" values between neighbors:
    All the message passing between neighbors has to be done before FLUX
    routine is called. 

    Each PROCess (except Me=1) must send its bottom row (j=1) of U to its NodeDN 
    neighbor and receive that neighbor's top row as its boundary values (j=0).

    Also, each PROCess (except Me=nWRs) must send its top row to NodeUP and
    receive that neighbor's bottom row as its top boundary values (j=Mz+1).
    Be careful with the indices and the logic!!!
    Do this crucial message passing on paper first !

                 GOOD LUCK and have fun !!!