Math578 - Alexiades
                    ACF info
  • ACF is UTK's Advanced Computing Facility (new, started 2017)

                  Running MPI code on ACF
    Clusters and HPC systems, like ACF, provide environments for running BATCH jobs.
    Resources are loaded by "module", which drastically simplifies the Makefile (barely need one!).
    Running code involves several steps you need to be aware of:
    Login, transfer files, put them in Lustre, compile, submit batch job (via 'qsub'), and wait for it to run...
    ACF consists of several clusters, Beacon, Monster, Rho, Sigma, ... each with several nodes (beacon has 48 compute nodes),
    plus several login nodes. Each node has 16 "cores" in 2 "sockets".

  • File systems: Home directories are mounted on login (service) nodes via NFS, but NOT mounted on the compute nodes.
    You MUST run jobs from the Lustre file system, which provides "scratch" space, mounted on all compute nodes.
      The envar $SCRATCHDIR points to your scratch space.
      Create a link to it in your home dir:   ln -s $SCRATCHDIR Scratch   so you can do:  cd Scratch
    Note: Lustre files are NOT backed up, and are deleted after 30 days, should copy important files to your $HOME often.
  • Compilers:.   There are 2 suites:   Intel: mpiicc, mpiicpc, mpiifort , and Gnu: gcc, g++, gfortran.
      They are loaded by module (see below). See module commands   Intel compilers are faster than Gnu usually.
  • Scheduler and PBS: Jobs are submitted (to Torque manager and Moab scheduler) via a "PBSscript".
      Copy the following to a (plain text) file named PBSscript or download PBSscript
      For each run, you will need to set: nodes=?:ppn=? , walltime , jobname , -n ?? , code.x
    ############ PBSscript for ACF ##########
    #PBS –A ACF-UTK0011             #(this is our account number)
    #PBS -l nodes=1:ppn=11		#(requests 11 cores of 1 node)
    #PBS -l walltime=01:30:00	#( hh:mm:ss )
    #PBS -N name_for_your_job       #(short single_string e.g. J256on11)
    #PBS -j oe
    #PBS -k oe
    cd $PBS_O_WORKDIR		#(points to dir job is submitted from)
    ####------ ACF mpich ------:
    mpirun -n 11 ./code.x  	# < ./dat > ./OUT  (to redirect I/O) 
    ############ end of PBSscript ##########
    Submit with: qsub PBSscript
      This will schedule job "J256on11" to be run on 11 cores of one node (when resources become available...).
    Important: The batch system will allocate entire node exclusivelly to you, even if you only use 1 core!
               The more nodes (and cores) you request, the longer it will take for your job to start running...

    Job monitoring: qstat -a (qu script, see below), showq -r , checkjob, qdel , ...

                Steps for compiling and running code (see Running Jobs )
  • Login to ACF (with your smartphone at hand for Duo...):   ssh -X
    [ To get another terminal without going thru Duo:   on ACF type:  nohup xterm -bg black -fg cyan -fn 8x13bold -ls &
       other colors: aquamarine , khaki , peachpuff , seagreen , ... and can reverse -bg with -fg.
       Can create an alias in your ~/.bashrc : alias xt="nohup xterm ..... " , then:   . ~/.bashrc , then: xt
       Or download this fancier xtloc script into a file "xtloc", and make it executable: chmod u+x xtloc ]
  • On another window on your PC, zip (the dir with your) code into a and scp ACF:
        scp -p
  • It will go to your $HOME . Copy it to your $SCRATCHDIR:  cp -ip Scratch
  • cd Scratch ;   unzip
  • cd CODE   Make sure you copy "PBSscript" into CODE/.
  • Check what's loaded: module list   7 are loaded by default (including Intel compilers).
      (To use Gnu compilers:   module swap PE-intel  PE-gnu)
  • Compile your code. Basically   mpiifort code.f90 -o code.x ...   or mpiicpc code.cpp -o code.x ...
      In Makefile you can insert: COMP = mpiifort or COMP = mpiicpc and then: make compile
    Note: To compile with '-fast' optimization, put these in Makefile, and do: make mpifast
    ##............on acf -fast  needs 2-steps:
            mpiifort $(code_f) -c -fPIE -fast
            mpiifort -pie $(code_o)  -o $(code).x 
  • Edit your PBSscript to customize it for this specific run. Then
  • submit the job:   qsub PBSscript   (or:  make pbs )
  • Check status:   qstat -a | grep $USER
      Better yet, use this   qu script   (put in a file "qu" and make it executable:   chmod u+x qu ).
      The first item displayed is JobID, needed for 'checkjob', 'qdel', ...
  • If a job is running and you want to kill it:   qdel JobID
                  Compile and run your SERIAL code on ACF
  • On your PC, put your lab3 code (and relevant files) into a dir "SERIAL".
  • Clean it up! Comment out any diagnostics, remove any and all interactive features.
      Only a data file should be read in (simplest way: a.out < dat ). Only the OUTPUT routine should print out
      (and main at the end of the run), only essentials.
      [For C++ programmers, strongly recommend printing via printf(...) and not via "<<", it's much cleaner...]
  • Copy the 'PBSscript' and 'qu' scripts into SERIAL .
  • zip -oy SERIAL/*  
  • transfer to ACF: scp
  • Login on ACF.
  • Copy to your Scratch dir, and: unzip

    To compile and run with Intel compiler:
  • mv SERIAL SERIAL-intel  (rename it)
  • cd SERIAL-intel ;   module list , should see PE-intel
  • Compile it. Name the executable 'serial-intel.x'
  • Edit PBSscript and set: node=1:ppn=1 , jobname: Jintel ,   -n 1
  • qsub PBSscript
  • ./qu   to see if it's running and the JobID, something like:
      63157   username   Jintel   --   R   00:00:13
  • Hopefully it will run and give you what you expect! Record the timing.

    To compile and run with Gnu compiler:
  • cd .. (to parent dir) ;   unzip ;   mv SERIAL SERIAL-gnu ; cd SERIAL-gnu
  • module swap PE-intel PE-gnu
  • module list   should show PE-gnu
  • Repeat the above, replacing "intel" with "gnu"
          Good luck!

  • Then do the above with your parallelized par1D code !
          Good luck!
                ...if I forgot anything let me know...       last updated on 1oct17