M578 - Alexiades
          Steps to methodically parallelize your serial code
                    Master - Workers style
              read ALL of this page VERY carefully !

Basic ingredients:
  • serial code to be parallelized
  • Makefile for running commands with 'make', see below
  • machine with some version of MPI installed
  • for running on a cluster with PBS scheduler you will need:   PBSscript   file
  • Sooner or later you may want/need to consult some of these about MPI:
    tutorials openMPI MPICH . The official standard is at MPI Forum

    Basic steps:   ...more details below...
  • organize, clean up, and debug your serial code, remove interactivity
  • parallelize your serial code... the hard part...
  • compile using a Makefile:   make compile
  • execute on your machine using nPROC processes:
        mpiexec -n  nPROC  ./code.x   < dat > o.out
  •   or   on a cluster (running PBS scheduler) via a script:   qsub PBSscript

    Some words of advice and caution:
  • ALWAYS keep backups: make a copy into a "try" dir and then modify.
  • Parallelize VERY carefully, cautiously, a little bit at a time. Use lots of print statements to see what's happening.
  • Think carefully... debugging is VERY hard, try to minimize it...
  • Once a routine is debugged, save it, back it up, and make a copy to modify further
      (Using a version-control tool, like 'svn', is recommended).
  • Insert plenty of comments indicating what is (supposed to be) happening... you'll be sorry if you don't...
  • Clarity and efficiency are paramount. Avoid coding tricks. Document what you are doing.
  • Parallelization will be only in the z-direction, so
       always use Mz=even number, and divisible by nWRs.

    Specific steps for parallelization : On your 1D code first !
    [Note on file names: I find it convenient to have all files of a code start with a common 
     letter, like: z.main.f z.io.f ...., and all output files like  o.out o.prof o.hist ....]
    1. 1Dserial code:
      a. Place a copy of your serial 1D code in a directory, say, 1Dserial/ .
      b. Organize your code:  Split your serial code into separate files: z.main.f  z.io.f  z.setup.f  z.update.f
         with obvious contents: input/output in io.f, MESH,INIT in setup.f, FLUX,PDE in update.f
      c. Download the file  Makefile_serial.    Save As "Makefile" (must be PLAIN text file!). 
         Look inside it to see what a makefile looks like. It contains macros and directives for 'make'.
         Note that directives (like 'run:') are followed by a line starting with TAB  (not spaces!!!).
         Customize it (set names, compiler, ...).
      d. Try it:  make compile   (executable will be $(PROG).x)
    	      make run	     (it should run)
    2. 1Dparallel code:
      a. Copy (all files of) your 1Dserial code into a new dir, say, 1Dpar  or  lab6
      b. Comment out all subroutine/function calls. 
      c. Compile each file with: COMPILER -c  z.*   and fix the worst problems.
      d. Make two copies of main.f:  mainMR.f, mainWR.f 
      e. In mainMR.f :  comment out what will NOT be done by MR. Compile.
         In mainWR.f :  comment out what will NOT be done by WR. Compile.
    3. Create a new  main.f  which: starts up MPI, calls MASTER() or WORKER() and shuts down MPI
       Here is a sample:
    -------------------------- main.f sample -------------------------------
          program main
          include 'mpif.h'		!!(include 'mpi.h' for C)
          ... (declare variables) ...
          call MPI_INIT( ierr )
          !....... nPROC is specified at mpirun or mpiexec, see Makefile....
          call MPI_COMM_SIZE( MPI_COMM_WORLD,nPROC,ierr )   !..returns nPROC
           mster = 0		! master gets rank=0
           nWRs  = nPROC - 1	! =number of workers
           !----------------- start 0, ... ,nWRs tasks ---------------!
           call MPI_COMM_RANK(MPI_COMM_WORLD, myID, ierr)    !..assigns myID
           IF( myID = mster ) THEN
                    tt0 = MPI_Wtime()       !...start CPU timer on MR
                call MASTER( nWRs, mster, ... )
                    tt1 = MPI_Wtime()       !...end timer
               print*,'>>main>> MR timing= ',tt1-tt0,' sec on ',nWRs,' WRs'
                call WORKER( nWRs, myID, ... )  !... now MPI is running ...!
                print*, 'Bye from WR:',myID,': ierr= ', ierr
              if( ierr .NE. 0 ) then
                  print*, '>>>> worker:',myID,' ended with ierr= ',ierr
    !...termination: the only clean way to exit is this:
            call MPI_FINALIZE(ierr)
    !<<< mpi<<<
    4. Download   Makefile_parallel.   Save As  'Makefile'.
       Customize it for 1Dpar code (set names, compiler, ...).
    5. Parallelization strategy: Domain Decomposition along one direction only (z-direction)
    6. Start inserting MPI calls in  mainMR.f and corresponding in mainWR.f
       See the sample coding in "outlineMPI" and look up the syntax of MPI functions.
       Try your hardest to do it correctly the first time!!!
         (in C coding, remove the 'ierr' item from the arguments).
       Insert one MPI call, test it, fix it, then another,...
       Use a very coarse mesh (MM=4 or 8), and lots of print statements 
       to see what's happening (then comment out).
    7. Test/correct till it runs with nPROC=2, i.e. nWRs=1, on local machine, 
       (even though nothing is computed yet till step 10 below). 
       This is still essentially serial.
    8. Test/correct till it runs with nPROC=3, i.e. nWRs=2, on local machine. 
       This is now parallel !  Most bugs will have been removed by this stage...
    9. Test/correct till it runs with nPROC=5, i.e. nWRs=4.
       Tougher bugs will have been removed at this stage...
    10. Once the basic MPI operations are working, start adding one 
        routine/function call at a time, test as in 6, and periodically as in 7, 8, 9,
        with lots of print statements.
        Make sure you keep backup copies of each version that works 
        so that you can get back to a working version if worse comes to worse...
    11. Exchanging "boundary" values between neighbors:
        All the message passing between neighbors has to be done before FLUX
        routine is called. 
        Each PROCess (except Me=1) must send its bottom row (j=1) of U to its NodeDN 
        neighbor and receive that neighbor's top row as its boundary values (j=0).
        Also, each PROCess (except Me=nWRs) must send its top row to NodeUP and
        receive that neighbor's bottom row as its top boundary values (j=Mz+1).
        Be careful with the indices and the logic!
        Do this crucial message passing on paper first !
                     GOOD LUCK and have fun !!!