Math578 - Alexiades
ACF is UTK's Advanced Computing Facility (new, started 2017)
Running MPI code on ACF
Clusters and HPC systems, like ACF, provide environments for running
Resources are loaded by "module",
which drastically simplifies the Makefile (barely need one!).
Running code involves several steps you need to be aware of:
Login, transfer files, put them in Lustre, compile,
submit batch job (via 'qsub'), and wait for it to run...
ACF consists of several clusters, Beacon, Monster, Rho, Sigma, ...
each with several nodes (beacon has 48 compute nodes),
plus several login nodes.
Each node has 16 "cores" in 2 "sockets".
Home directories are mounted on login (service) nodes via NFS,
but NOT mounted on the compute nodes.
You MUST run jobs from the Lustre file system,
which provides "scratch" space, mounted on all compute nodes.
The envar $SCRATCHDIR points to your scratch space.
Create a link to it in your home dir:
ln -s $SCRATCHDIR Scratch
so you can do: cd Scratch
Note: Lustre files are NOT backed up, and are deleted after
30 days, should copy important files to your $HOME often.
There are 2 suites: Intel: mpiicc, mpiicpc, mpiifort ,
and Gnu: gcc, g++, gfortran.
They are loaded by module (see below). See
Intel compilers are faster than Gnu usually.
Scheduler and PBS:
Jobs are submitted (to Torque manager and Moab scheduler) via a "PBSscript".
Copy the following to a (plain text) file named PBSscript
For each run, you will need to set:
nodes=?:ppn=? , walltime , jobname , -n ?? , code.x
############ PBSscript for ACF ##########
#PBS –A ACF-UTK0011 #(this is our account number)
#PBS -l nodes=1:ppn=11 #(requests 11 cores of 1 node)
#PBS -l walltime=01:30:00 #( hh:mm:ss )
#PBS -N name_for_your_job #(short single_string e.g. J256on11)
#PBS -j oe
#PBS -k oe
cd $PBS_O_WORKDIR #(points to dir job is submitted from)
####------ ACF mpich ------:
mpirun -n 11 ./code.x # < ./dat > ./OUT (to redirect I/O)
############ end of PBSscript ##########
Submit with: qsub PBSscript
This will schedule job "J256on11" to be run on 11 cores of one node
(when resources become available...).
The batch system will allocate entire node exclusivelly to you,
even if you only use 1 core!
The more nodes (and cores) you request, the longer it will take
for your job to start running...
Job monitoring: qstat -a (qu script, see below), showq -r , checkjob, qdel , ...
Steps for compiling and running code
(see Running Jobs )
Login to ACF (with your smartphone at hand for Duo...):
ssh -X NetID@duo.acf.tennessee.edu
[ To get another terminal without going thru Duo:
on ACF type:
nohup xterm -bg black -fg cyan -fn 8x13bold -ls &
other colors: aquamarine , khaki , peachpuff , seagreen , ...
and can reverse -bg with -fg.
Can create an alias in your ~/.bashrc :
alias xt="nohup xterm ..... " , then:
. ~/.bashrc , then: xt
Or download this fancier xtloc script into a file "xtloc", and make it executable: chmod u+x xtloc ]
On another window on your PC, zip (the dir with your) code into a
CODE.zip and scp ACF:
scp -p CODE.zip NetID@acf-login2.nics.utk.edu:CODE.zip
It will go to your $HOME . Copy it to your $SCRATCHDIR:
cp -ip CODE.zip Scratch
cd Scratch ; unzip CODE.zip
cd CODE Make sure you copy "PBSscript" into CODE/.
Check what's loaded: module list
7 are loaded by default (including Intel compilers).
(To use Gnu compilers: module swap PE-intel PE-gnu)
Compile your code. Basically
mpiifort code.f90 -o code.x ...
or mpiicpc code.cpp -o code.x ...
In Makefile you can insert: COMP = mpiifort
or COMP = mpiicpc
and then: make compile
Note: To compile with '-fast' optimization, put these in Makefile, and do: make mpifast
##............on acf -fast needs 2-steps:
mpiifort $(code_f) -c -fPIE -fast
mpiifort -pie $(code_o) -o $(code).x
Edit your PBSscript to customize it for this specific run. Then
submit the job: qsub PBSscript
(or: make pbs )
Check status: qstat -a | grep $USER
Better yet, use this
(put in a file "qu" and make it executable: chmod u+x qu ).
The first item displayed is JobID, needed for 'checkjob', 'qdel', ...
If a job is running and you want to kill it:
Compile and run your SERIAL code on ACF
On your PC, put your lab3 code (and relevant files)
into a dir "SERIAL".
Clean it up! Comment out any diagnostics, remove any and all interactive features.
Only a data file should be read in (simplest way: a.out < dat ).
Only the OUTPUT routine should print out
(and main at the end of the run), only essentials.
[For C++ programmers, strongly recommend printing via
printf(...) and not via "<<", it's much cleaner...]
Copy the 'PBSscript' and 'qu' scripts into SERIAL .
zip -oy SERIAL.zip SERIAL/*
transfer to ACF: scp SERIAL.zip NetID@acf-login2.nics.utk.edu:SERIAL.zip
Login on ACF.
Copy SERIAL.zip to your Scratch dir, and: unzip SERIAL.zip
To compile and run with Intel compiler:
mv SERIAL SERIAL-intel (rename it)
cd SERIAL-intel ; module list , should see PE-intel
Compile it. Name the executable 'serial-intel.x'
Edit PBSscript and set: node=1:ppn=1 ,
jobname: Jintel , -n 1
./qu to see if it's running and the JobID, something like:
63157 username Jintel -- R 00:00:13
Hopefully it will run and give you what you expect!
Record the timing.
To compile and run with Gnu compiler:
cd .. (to parent dir) ;
unzip SERIAL.zip ; mv SERIAL SERIAL-gnu ;
module swap PE-intel PE-gnu
module list should show PE-gnu
Repeat the above, replacing "intel" with "gnu"
Then do the above with your parallelized par1D code !
...if I forgot anything let me know...
last updated on 1oct17