From Documentation
Jump to: navigation, search
(Can I find my job submission history?)
(How many jobs can I submit in one cluster?)
 
(30 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{LegacyPage}}
 
 
<!--This page is transcluded to the main FAQ page.  If you make changes here, make sure the changes show up on the main FAQ page.  You may have to make an edit to the main FAQ page to force a refresh. -->
 
<!--This page is transcluded to the main FAQ page.  If you make changes here, make sure the changes show up on the main FAQ page.  You may have to make an edit to the main FAQ page to force a refresh. -->
 
== Compiling and Running Programs ==
 
== Compiling and Running Programs ==
Line 7: Line 6:
 
For information about how to compile on older SHARCNET systems, see [[Legacy Systems]].
 
For information about how to compile on older SHARCNET systems, see [[Legacy Systems]].
  
 +
===How do I run a program interactively?===
  
=== Relocation overflow and/or truncated to fit errors ===
+
For running interactive jobs on '''graham''' and other national systems, see [https://docs.computecanada.ca/wiki/Running_jobs/en#Interactive_jobs Running jobs] page on Compute Canada wiki.
  
If you get "relocation overflow" and/or "relocation truncated to fit" errors when you compile
+
If trying interactive jobs on legacy systems, see [[Legacy Systems]].
big fortran 77 codes using pathf90 and/or ifort, then you should try the following:
+
 
+
(A) If the static data structures in your fortran 77 program are greater than 2GB
+
you should try specifying the option -mcmodel=medium in your pathf90 or ifort command.
+
 
+
(B) Try running the code on a different system which has more memory:
+
 
+
    Other clusters that you can try are: requin or hound
+
 
+
You would probably benefit from looking at the listing of all of the clusters:
+
 
+
https://www.sharcnet.ca/my/systems
+
 
+
and this page has a table showing how busy each one is:
+
 
+
https://www.sharcnet.ca/my/perf_cluster/cur_perf
+
 
+
=== How do I run a program? ===
+
 
+
In general, users are expected to run their jobs in "batch mode".  That is, one submits a job -- the application problem -- to a queue through a batch queue command, the scheduler schedules the job to run at a later time and sends the results back once the program is finished.
+
 
+
In particular, one will use [[sqsub|'''sqsub''']] command (see [[FAQ#What_is_the_batch_job_scheduling environment_SQ.3F|What is the batch job scheduling environment SQ?]] below) to launch a serial job foo
+
 
+
sqsub -o foo.log -r 5h ./foo
+
 
+
This means to submit the command <tt>foo</tt> as a job with a 5 hour runtime limit and put its standard output into a file <tt>foo.log</tt> (note that it is important to not put too tight of a runtime limit on your job as it may sometimes run slower than expected due to interference from other jobs).
+
 
+
If your program takes command line arguments, place the arguments after your program name just as when you run the program interactively
+
 
+
sqsub -o foo.log -r 5h ./foo arg1 arg2...
+
 
+
For example, suppose your program takes command line options <tt>-i input</tt> and <tt>-o output</tt> for input and output files respectively, they will be treated as the arguments of your program, not the options of <tt>sqsub</tt>, as long as they appear after your program in your sqsub command
+
 
+
sqsub -o foo.log -r 5h ./foo -i input.dat -o output.dat
+
 
+
If you have more than one sponsor and your non-primary sponsor has the username smith, you can attribute your usage to the smith project/accounting group like this:
+
 
+
sqsub -o foo.log -r 5h -p smith ./foo
+
 
+
To launch a parallel job <tt>foo_p</tt>
+
 
+
sqsub -q mpi -n num_cpus -o foo_p.log -r 5h ./foo_p
+
 
+
The basic queues on SHARCNET are:
+
 
+
{| class="wikitable" style="text-align:left" border="1"
+
! queue    !! usage
+
|-
+
| serial  || for serial jobs
+
|-
+
| mpi      || for parallel jobs using the MPI library
+
|-
+
| threaded || for threaded jobs using OpenMP or POSIX threads
+
|}
+
 
+
To see the status of submitted jobs, use command <tt>sqjobs</tt>.
+
 
+
=== How do I run a program interactively? ===
+
 
+
Several of the clusters now provide a collection of development nodes that can be used for this purpose.  An interactively session can also be started by submitting a <code>screen -D -m bash</code> command as a job.  If your job is a serial job, the submission line should be
+
 
+
sqsub -q serial -r <RUNTIME> -o /dev/null screen -D -fn -m bash
+
 
+
Once the job begins running, figure out what compute node it has launched on
+
 
+
sqjobs
+
 
+
and then ssh to this node and attach to the running screen session
+
 
+
ssh -t <NODE> screen -r
+
 
+
You can access screens options via the ''ctrl+a'' key stroke. Some examples are ''ctrl+a ?'' to bring up help and ''ctrl+a a'' to send a ''ctrl+a''. See the screen man page (man screen) for more information. The message ''Suddenly the Dungeon collapses!! - You die...'' is screen's way of telling you it is being killed by the scheduler (most likely because the time you specified for the job has elapsed). The <code>exit</code> command will terminate the session.
+
 
+
If your jobs is a MPI job, the submission line should be
+
 
+
sqsub -q mpi --nompirun -n <NODES> -r <TIME> -o /dev/null screen -D -fn -m bash
+
 
+
Once the job starts, the screen sessions will be launch screen on the rank zero node. This may not be the lowest number node allocated, so you have to run
+
 
+
qstat -f -l <JOBID> | egrep exec_host
+
 
+
to find out what node it is (the first one listed). You can then proceed as in the non-mpi case. The command <code>pbsdsh -o <COMMAND></code> can be used to run commands on all the allocated nodes (see the man pbsdsh), and the command <code>mpirun <COMMAND></code> can be used to start MPI programs on the nodes.
+
 
+
=== What about running a program compiled on one cluster on another? ===
+
 
+
In general, if your program starts executing on a system other than the one it was compiled on, then there are likely no issues.  However, you may want to compare results of test jobs just to make sure.  The specific list of things to watch out for are
+
 
+
# using a particular compiler and/or optimizations
+
# using a particular library (most frequently a specific MPI implementation)
+
 
+
In general, as long as very specific architecture optimizations are not being used, you should be able to compile a program on one SHARCNET system and run it on others as most systems are binary compatible and the compiler runtime libraries are installed everywhere.  In particular, this is true for our larger core systems and should be true for our other specialized systems as well (the big exception are executables compiled to use the GPU - these will only run on clusters containing GPU nodes).  It is worth noting that some compilers produce faster code on particular processors, and some compiler optimizations may not not work on all systems, so you may want to recompile in order to get the best performance.  We actually have different default compilers on different systems.  It is probably worth doing some comparisons on your own code because our tests show no clear winners.
+
 
+
With regard to MPI, and other libraries, you have to be a little more careful.  Most of the core systems have most of the same libraries and use OpenMPI by default though the default version will vary between clusters.  Programs which work on one system  should be able to run on another system without any modification as long as the OpenMPI version matches (at the end of the day as long as the runtime libraries and the necessary dependencies are installed you shouldn't have any problems).  If the default version of OpenMPI on a system is not the one needed, a different version can be selected via "module switch" command.
+
 
+
For example, if you compiled your program on one system using the default modules there, you should be able to run the same executable on another system with different defaults as long as you switch to those modules.  For example, switching from the default intel and openmpi modules can be accomplished with:
+
 
+
<pre>
+
module switch intel/11.0.083
+
module switch openmpi/intel/1.4.2
+
</pre>
+
 
+
To make sure the right modules are loaded, execute:
+
 
+
<pre>
+
module list
+
</pre>
+
 
+
Another thing to watch out for is using ''/home'' because it is global to legacy systems (not including Graham).  Because ''/home'' is global, it is slow and is not intended to be used as a working directory for running jobs.  If your program on legacy systems writes to the local /work and /scratch filesystems on the compute clusters, and you submit the job from /work or /scratch (so that the stdout gets written there), then running the executable from /home should be fine.  However, if it is run from and/or writes to ''/home'', then it will suffer a severe performance penalty.  It's probably easiest to set up your working directory in /work and then just symlink to your binary in /home.
+
  
 
=== My application runs on Windows, can I run it on SHARCNET? ===
 
=== My application runs on Windows, can I run it on SHARCNET? ===
Line 140: Line 32:
 
     dependency=afterok:<jobid>
 
     dependency=afterok:<jobid>
  
Other strategies for [https://docs.computecanada.ca/wiki/Running_jobs#Resubmitting_jobs_for_long_running_computations resubmitting jobs for long running computations] on the Slurm scheduled systems are described on the Compute Canada Wiki.
+
Other strategies for [https://docs.computecanada.ca/wiki/Running_jobs#Resubmitting_jobs_for_long_running_computations resubmitting jobs for long running computations] on the Slurm scheduled systems are described on the Compute Canada Wiki.
 
+
Job dependencies can be handled similarly on the legacy MOAB scheduled systems via the ''-w'' flag to the ''sqsub'' command. Once you have ensured that your job can automatically resume from a checkpoint the best way conduct long simulations is to submit a chain of jobs, such that each subsequent job depends on the jobid before it.  This will minimize the time your subsequent jobs will wait to run.
+
 
+
This can be done with the sqsub ''-w'' flag, eg.
+
 
+
    -w|--waitfor=jobid[,jobid...]]
+
                    wait for a list of jobs to complete
+
 
+
For example, consider the following instance where we want job #2 to start after job #1.  We first submit job #1:
+
 
+
[snuser@bul131 ~]$ sqsub -r 10m -o chain.test hostname
+
WARNING: no memory requirement defined; assuming 1GB
+
submitted as jobid 5648719
+
 
+
Now when we submit job #2 we specify the jobid from the first job:
+
 
+
[snuser@bul131 ~]$ sqsub -r 10m -w 5648719 -o chain.test hostname
+
WARNING: no memory requirement defined; assuming 1GB
+
submitted as jobid 5648720
+
 
+
Now you can see that two jobs are queued, and one is in state "*Q" - meaning that it has conditions:
+
 
+
[snuser@bul131 ~]$ sqjobs
+
  jobid  queue state ncpus nodes time command
+
------- ------ ----- ----- ----- ---- -------
+
5648719 serial    Q    1    -  15s hostname
+
5648720 serial    *Q    1    -  2s hostname
+
2232 CPUs total, 1607 busy; 1559 jobs running; 1 suspended, 6762 queued.
+
403 nodes allocated; 154 drain/offline, 558 total.
+
 
+
Looking at the second job in detail we see that it will not start until the first job has completed with an "afterok" status:
+
 
+
[snuser@bul131 ~]$ qstat -f 5648720 | grep -i depend
+
    depend = afterok:5648719.krasched@krasched
+
    -N hostname -l pvmem=1024m -m n -W depend=afterok:5648719 -l walltime=
+
 
+
In this fashion it is possible to string many jobs together.  The second job (5648720) should continue to accrue priority in the queue while the first job is running, so once the first job completes the second job should start much more quickly than if it were submitted after the first job completed.
+
  
 
=== How do I checkpoint/restart my program? ===
 
=== How do I checkpoint/restart my program? ===
  
 
Checkpointing is a valuable strategy that minimizes the loss of valuable compute time should a long running job be unexpectedly killed by a power outage, node failure, or hitting its runtime limit. On the national systems checkpointing can be accomplished manually by creating and loading your own custom checkpoint files or by using the Distributed MultiThreaded CheckPointing (DMTCP) software without having to recompile your  program. For further documentation of the checkpointing and DMTCP software see the [https://docs.computecanada.ca/wiki/Points_de_contr%C3%B4le/en Checkpoints] page at the Compute Canada Wiki site.
 
Checkpointing is a valuable strategy that minimizes the loss of valuable compute time should a long running job be unexpectedly killed by a power outage, node failure, or hitting its runtime limit. On the national systems checkpointing can be accomplished manually by creating and loading your own custom checkpoint files or by using the Distributed MultiThreaded CheckPointing (DMTCP) software without having to recompile your  program. For further documentation of the checkpointing and DMTCP software see the [https://docs.computecanada.ca/wiki/Points_de_contr%C3%B4le/en Checkpoints] page at the Compute Canada Wiki site.
 
Assuming that the code is serial or multi-threaded (*not* MPI), you can use Berkeley Labs Checkpoint Restart software (BLCR) on legacy SHARCNET systems.  Documentation and usage instructions can be found on SHARCNET's [http://www.sharcnet.ca/my/software/show/74 BLCR software] page.  Note that BLCR requires your program to use shared libraries (not be statically compiled).
 
  
 
If your program is MPI based (or any other type of program requiring a specialized job starter to get it running), it will have to be coded specifically to save state and restart from that state on its own.  Please check the documentation that accompanies any software you are using to see what support it has for checkpointing.  If the code has been written from scratch, you will need to build checkpointing functionality into it yourself---output all relevant parameters and state such that the program can be subsequently restarted, reading in those saved values and picking up where it left off.
 
If your program is MPI based (or any other type of program requiring a specialized job starter to get it running), it will have to be coded specifically to save state and restart from that state on its own.  Please check the documentation that accompanies any software you are using to see what support it has for checkpointing.  If the code has been written from scratch, you will need to build checkpointing functionality into it yourself---output all relevant parameters and state such that the program can be subsequently restarted, reading in those saved values and picking up where it left off.
Line 196: Line 49:
  
 
It is important to note that the estimated start time listed in the START_TIME column (if available) can change substantially over time. This start time estimate is based on the current state of the compute nodes and list of jobs in the queue. Because the state of the compute nodes and list of jobs in the queue are constantly changing the start time estimates for pending jobs can change for several reasons (running jobs end sooner than expected, higher priority jobs enter the queue, etc). For more information regarding the variables that affect wait times in the queue see the [https://docs.computecanada.ca/wiki/Job_scheduling_policies job scheduling policy] page at the Compute Canada Wiki site.
 
It is important to note that the estimated start time listed in the START_TIME column (if available) can change substantially over time. This start time estimate is based on the current state of the compute nodes and list of jobs in the queue. Because the state of the compute nodes and list of jobs in the queue are constantly changing the start time estimates for pending jobs can change for several reasons (running jobs end sooner than expected, higher priority jobs enter the queue, etc). For more information regarding the variables that affect wait times in the queue see the [https://docs.computecanada.ca/wiki/Job_scheduling_policies job scheduling policy] page at the Compute Canada Wiki site.
 
=== How do I run a program remotely? ===
 
 
It is also possible to specify a command to run on the end of a ssh command.  A command like <tt>ssh narwhal.sharcnet.ca sqjobs</tt>, however, will not work because ssh does not setup a full environment by default.  In order to get the same environment you get as when you login, it is necessary to run the command under bash in login mode.
 
 
myhost$ ssh username@graham.computecanada.ca bash -l -c squeue
 
 
If you wish to specify a command longer than a single word, it is necessary to quote it as the bash <tt>-c</tt> only takes a single argument.  In order to pass these quotes through to ssh, however, it is necessary to escape them.  Otherwise the local shell will interpret them and strip them off.  An example is
 
 
myhost$ ssh username@graham.computecanada.ca bash -l -c \'squeue -u username'\
 
 
Most problems with these commands are related to the local shell interpreting things that you wish to pass through to the remote side (e.g., stripping out any unescaped quotes).  Use <tt>-v</tt> with ssh and <tt>set -x</tt> with bash to see what command(s) ssh and bash are executing respectively.
 
 
myhost$ ssh -v username@graham.computecanada.ca bash -l -c \'squeue -u username\'
 
myhost$ ssh username@graham.computecanada.ca bash -l -c \' set -x\;squeue -u username\'
 
  
 
=== Is package X preinstalled on system Y, and, if so, how do I run it? ===
 
=== Is package X preinstalled on system Y, and, if so, how do I run it? ===
Line 217: Line 55:
  
 
For legacy SHARCNET systems the list of preinstalled packages (with running instructions) can be found on the [https://www.sharcnet.ca/my/software SHARCNET software page].
 
For legacy SHARCNET systems the list of preinstalled packages (with running instructions) can be found on the [https://www.sharcnet.ca/my/software SHARCNET software page].
 
=== A package is available sometimes as the default, sometimes as a module. What is the difference?===
 
 
We have implemented the Modules system for all supported software packages on our clusters - each version of each software package that we have installed can be dynamically loaded or unloaded in your user environment with the ''module'' command. 
 
 
See the [https://docs.computecanada.ca/wiki/Utiliser_des_modules/en using modules] page on the Compute Canada Wiki site for further information, including examples.
 
 
=== What is the batch job scheduling environment SQ? ===
 
 
SQ is a unified frontend for submitting jobs on SHARCNET, intended to hide unnecessary differences in how the clusters are configured.  On clusters which are based on RMS, LSF+RMS, or Torque+Maui, SQ is just a thin shell of scripting over the native commands.  On Wobbie, the native queuing system is called SQ.
 
 
To submit a job, you use [[sqsub|<tt>sqsub</tt>]]:
 
 
sqsub -n 16 -q mpi -r 5h ./foo
 
 
This submits <tt>foo</tt> as an MPI command on 16 processors with a 5 hour runtime limit (make sure to be somewhat conservative with the runtime limit as a job may run for longer than expected due to interference from other jobs).  You can control input, output and error output using these flags:
 
 
sqsub -o outfile -i infile -e errfile -r 5h ./foo
 
 
this will run <tt>foo</tt> with its input coming from a file named <tt>infile</tt>, its standard output going to a file named <tt>outfile</tt>, and its error output going to a file named <tt>errfile</tt>.  Note that using these flags is preferred over shell redirection, since the flags permit your program to do IO directly to the file, rather than having the IO transported over sockets, then to a file.
 
 
For threaded applications (which use Pthreads, OpenMP, or fork-based parallelism), do this:
 
 
sqsub -q threaded -n 2 -r 5h -o outfile ./foo
 
 
For serial jobs
 
 
sqsub -r 5h -o outfile ./foo
 
 
=== How do I check running jobs and control jobs under SQ? ===
 
 
To show your jobs, use "sqjobs". by default, it will show you only your own jobs. with "-a" or "-u all", it will show all users. similarly, "-u someuser" will show jobs only for this particular user.
 
 
the "state" listed for a job is one of the following:
 
* Q - queued
 
* R - running
 
* Z - suspended (sleeping)
 
* D - done (shown briefly on some systems)
 
* ? - unknown (something is wrong, such as a node crashing)
 
 
times shown are the amount of time since being submitted (for queued jobs) or starting (for all others).
 
 
To kill, suspend or resume your jobs, use sqkill/suspend/resume with the job ID as shown by sqjobs.
 
  
 
=== Command 'top' gives me two different memory size (virt, res). What is the difference between 'virtual' and 'real' memory? ===
 
=== Command 'top' gives me two different memory size (virt, res). What is the difference between 'virtual' and 'real' memory? ===
Line 276: Line 71:
 
Yes. For instance, suppose you have a number of source files <tt>main.f, sub1.f, sub2.f, ..., subN.f</tt>, to compile these source code to generate an executable myprog, it's likely that you will type the following command
 
Yes. For instance, suppose you have a number of source files <tt>main.f, sub1.f, sub2.f, ..., subN.f</tt>, to compile these source code to generate an executable myprog, it's likely that you will type the following command
  
  f77 main.f sub1.f sub2.f ... sub N.f -llapack -o myprog  
+
  ifort main.f sub1.f sub2.f ... sub N.f -llapack -o myprog  
  
 
Here, the <tt>-o</tt> option specifies the executable name myprog rather than the default <tt>a.out</tt> and the option <tt>-llapack</tt> at the end tells the compiler to link your program against the LAPACK library, if LAPACK routines are called in your program. If you have long list of files, typing the above command every time can be really annoying. You can instead put the command in a file, say, mycomp, then make mycomp executable by typing the following command
 
Here, the <tt>-o</tt> option specifies the executable name myprog rather than the default <tt>a.out</tt> and the option <tt>-llapack</tt> at the end tells the compiler to link your program against the LAPACK library, if LAPACK routines are called in your program. If you have long list of files, typing the above command every time can be really annoying. You can instead put the command in a file, say, mycomp, then make mycomp executable by typing the following command
Line 289: Line 84:
  
 
This is a simple way to minimize typing, but it may wind up recompiling code which has not changed.  A widely used improvement, especially for larger/many source files, is to use [http://mrbook.org/blog/tutorials/make/ make].  <tt>make</tt> permits recompilation of only those source files which have changed since last compilation, minimizing the time spent waiting for the compiler.  On the other hand, compilers will often produce faster code if they're given all the sources at once (as above).
 
This is a simple way to minimize typing, but it may wind up recompiling code which has not changed.  A widely used improvement, especially for larger/many source files, is to use [http://mrbook.org/blog/tutorials/make/ make].  <tt>make</tt> permits recompilation of only those source files which have changed since last compilation, minimizing the time spent waiting for the compiler.  On the other hand, compilers will often produce faster code if they're given all the sources at once (as above).
 
=== I get errors trying to redirect input into my program when submitted to the queues, but it runs fine if run interactively ===
 
 
The standard method to attach a file as the input to a program when submitting to SHARCNET queues is to use the <tt>-i</tt> flag to <tt>sqsub</tt>, e.g.:
 
 
sqsub -q serial -i inputfile.txt ...
 
 
Occasionally you will encounter a situation where this approach appears not to work, and your program fails to run successfully (reasons for which can be very subtle).  Here is an example of one such message that was being generated by a FORTRAN program:
 
 
lib-4001 : UNRECOVERABLE library error
 
    A READ operation tried to read past the end-of-file.
 
 
Encountered during a list-directed READ from unit 5
 
Fortran unit 5 is connected to a sequential formatted text file
 
    (standard input).
 
/opt/sharcnet/sharcnet-lsf/bin/sn_job_starter.sh: line 75: 25730 Aborted (core dumped) "$@"
 
 
yet if run on the command line, using standard shell redirection, it works fine, e.g.:
 
 
program < inputfile.txt
 
 
Rather than struggle with this issue, there is an easy workaround: instead of submitting the program directly, submit a script that takes the name of the file for input redirection as an argument, and have that script launch your program making use of shell redirection.  This circumvents whatever issue the scheduler is having by not having to do the redirection of the input via the submission command.  The following shell script will do this (you can copy this directly into a text file and save it to disk; the name of the file is arbitrary but we'll assume it to be <tt>exe_wrapper.sh</tt>).
 
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
! style="background:#8AA8E5;" | ''Bash Shell script'': <tt>exe_wrapper.sh</tt>
 
|-
 
|<source lang="bash">
 
#!/bin/bash
 
 
EXENAME=replace_with_name_of_real_executable_program
 
 
if (( $# != 1 )); then
 
        echo "ERROR: incorrect invocation of script"
 
        echo "usage: ./exe_wrapper.sh <input_file>"
 
        exit 1
 
fi
 
 
./${EXENAME} < ${1}
 
</source>
 
|}
 
 
Note that you must edit the <tt>EXENAME</tt> variable to reference the name of the actual executable, and can be easily modified to take or provide additional arguments to the program being executed as desired.  Ensure the script is executable by running <tt>chmod +x exe_wrapper.sh</tt>.  You can now submit the job by submitting the *script*, with a single argument being the file to be used as input, i.e:
 
 
sqsub -q serial -r 5h -o outputfile.log ./exe_wrapper.sh intputfile.txt
 
 
This will result in the job being run on a compute node as if you had typed:
 
 
./program < inputfile.txt
 
 
NOTE: this workaround, as provided, will only work for serial programs, but can be modified to work with MPI jobs by further leveraging the <tt>--nompirun</tt> option to the scheduler, and launching the parallel job within the script using <tt>mpirun</tt> directly.  This is explained [[Knowledge_Base#How do I submit an MPI job such that it doesn't automatically execute mpirun? | below]].
 
 
=== How do I submit an MPI job such that it doesn't automatically execute mpirun? ===
 
 
This can be done by using the <tt>--nompirun</tt> flag when submitting your job with <tt>sqsub</tt>.  By default, MPI jobs submitted via <tt>sqsub -q mpi</tt> are expected to be MPI programs, and the system automatically launches your program with mpirun.  While this is convenient in most cases, some users may want to implement pre or post processing for their jobs, in which case they may want to encapsulate their MPI job in a shell script.
 
 
Using <tt>--nompirun</tt> means that you have to take responsibility for providing the correct MPI launch mechanism, which depends on the scheduler as well as the MPI library in use.  You can actually see what the system default is by running <tt>sqsub -vd ...</tt>.
 
 
{| class="wikitable" style="text-align:left" border="1"
 
! system !! mpi launch prefix
 
|-
 
| most  || /opt/sharcnet/openmpi/VERSION/COMPILER/bin/mpirun
 
|}
 
'''NOTE''': VERSION is the version number, COMPILER is the compiler used to compile the library, eg. ''/opt/sharcnet/openmpi/1.6.2/intel/bin/mpirun''
 
 
Our legacy systems (eg. Goblin, Monk and others) are based on Centos, Torque/Maui/Moab and OpenMPI.
 
 
The basic idea is that you'd write a shell script (eg. named mpi_job_wrapper.x) to do some actions surrounding your actual MPI job (using requin as an example here):
 
 
#!/bin/bash
 
echo "hello this could be any pre-processing commands"
 
/opt/hpmpi/bin/mpirun -srun ./mpi_job.x
 
echo "hello this could be any post-processing commands"
 
 
You would then make this script executable with:
 
 
chmod +x mpi_job_wrapper.x
 
 
and submit this to run on 4 cpus for 7 days with job output sent to <i>wrapper_job.out</i>:
 
 
sqsub -r 7d -q mpi -n 4 --nompirun -o wrapper_job.out ./mpi_job_wrapper.x
 
 
now you would see the following output in ./wrapper_job.out:
 
 
hello this could be any pre-processing commands
 
<any output from the MPI job>
 
hello this could be any post-processing commands
 
 
On newer clusters, due to the extreme spread of memory and cores across sockets/dies, getting good performance requires binding your processes to cores so they don't wander away for the local resource they start using.  The  <tt>mpirun</tt> flags <tt>--bind-to-core</tt> and <tt>--cpus-per-proc</tt> are for this.  If <tt>sqsub -vd ...</tt> shows these flags, make sure to duplicate them in your own scripts.  If it does not show them, do not use them.  They require special scheduler support, and without this, your process will windup bound to cores other jobs are using
 
 
There are a number of reasons NOT to use your own scripts as well: with --nompirun, your job will have allocated a number of cpus, but the non-MPI portions of your script will run serially.  This wastes cycles on all but one of the processors - a serious concern for long serial sections and/or jobs with many cpus.  "sqsub --waitfor" provides a potentially more efficient mechanism for chaining jobs together, since it permits a hypothetical serial post-processing step to allocate only a single CPU.
 
 
But this also brings up another use-case: your --nompirun script might also consist of multiple MPI sub-jobs.  For instance, you may have chosen to break up your workflow into two separate MPI programs, and want to run them successively.  You can do this with such a script, including possible adjustments, perhaps to output files, between the two MPI programs.  Some of our users have done iterative MPI jobs this way, were an MPI program is run, then its outputs massaged or adjusted, and the MPI program run again.  Strictly speaking, you can do whatever you want with the resources you allocate as part of a job - multiple MPI subjobs, serial sections, etc.
 
 
Some cases need to know the allocated node names and the number of cpus on the node in order to construct its own hostfile or so. This is possible by using '$LSB_MCPU_HOSTS' environment variable. You may insert lines below into your bash script
 
 
echo $LSB_MCPU_HOSTS
 
arr=($LSB_MCPU_HOSTS)
 
echo "Hostname= ${arr[0]}"
 
echo "# of cpus= ${arr[1]}"
 
 
Then, you may see
 
 
bru2 4
 
Hostname= bru2
 
# of cpus= 4
 
 
in your output file. Utilizing this, you can construct your own hostfile whenever you submit your job.
 
 
The following example shows a job wrapper script (eg. ./mpi_job_wrapper.x ) that translates an LSF job layout to an OpenMPI hostfile, and launches the job on the nodes in a round robin fashion:
 
 
  #!/bin/bash
 
  echo 'hosts:' $LSB_MCPU_HOSTS
 
  arr=($LSB_MCPU_HOSTS)
 
  if [ -e ./hostfile.$$ ]
 
  then
 
                  rm -f ./hostfile.$$
 
  fi
 
  for (( i = 0 ; i < ${#arr[@]}-1 ; i=i+2 ))
 
  do
 
                  echo ${arr[$i]} slots=${arr[$i+1]} >> ./hostfile.$$
 
  done
 
  /opt/sharcnet/openmpi/current/intel/bin/mpirun -np 2 -hostfile ./hostfile.$$ -bynode ./a.out
 
 
Note that one would still have to set the desired number of process in the final line (in this case it's only set to 2).  This could serve as a framework for developing more complicated job wrapper scripts for OpenMPI on the XC systems.
 
 
If you are having issues with using <tt>--nompirun</tt> we recommend that you submit a [https://www.sharcnet.ca/my/problems/submit problem ticket] so that staff can help you figure out how it should be utilized on the particular system you are using.
 
 
=== How do I submit a large number of jobs with a script? ===
 
 
There are two methods: you can pack a large number of runs into a single submitted job, or you can use a script to submit a large number of jobs to the scheduler.
 
 
With the first method, you would write a shell script (let us call it start.sh) similar to the one found above. On requin with the older HP-MPI it would be something like this:
 
 
#!/bin/csh
 
/opt/hpmpi/bin/mpirun -srun ./mpiRun1 inputFile1
 
/opt/hpmpi/bin/mpirun -srun ./mpiRun2 inputFile2
 
/opt/hpmpi/bin/mpirun -srun ./mpiRun3 inputFile3
 
echo Job finishes at `date`.
 
exit
 
 
On orca with OpenMPI the script would be (note that the number of processors should match whatever you specify with sqsub):
 
 
#!/bin/bash
 
/opt/sharcnet/openmpi/1.6.2/intel/bin/mpirun -np 4 --machinefile $PBS_NODEFILE ./mpiRun1
 
/opt/sharcnet/openmpi/1.6.2/intel/bin/mpirun -np 4 --machinefile $PBS_NODEFILE ./mpiRun2
 
/opt/sharcnet/openmpi/1.6.2/intel/bin/mpirun -np 4 --machinefile $PBS_NODEFILE ./mpiRun3
 
 
Then you can submit it with:
 
 
sqsub -r 7d -q mpi -n 4 --nompirun -o outputFile ./start.sh
 
 
Your mpi runs (mpiRun1, mpiRun2, mpiRun3) will run one at a time, using all available processors within the job's allocation, i.e. whatever you specify with the -n option in sqsub. Please be aware of the total execution time for all runs, as with a large number of jobs it can easily exceed the maximum allowed 7 days, in which case the remaining runs will never start.
 
 
With the second method, your script would contain sqsub inside it. This approach is described in [[Serial / parallel farming (or throughput computing)]].
 
  
 
=== I have a program that runs on my workstation, how can I have it run in parallel? ===
 
=== I have a program that runs on my workstation, how can I have it run in parallel? ===
Line 450: Line 91:
  
 
Also, the preceding answer pertains only to the idea of running a ''single'' program faster using parallelism. Often, you might want to run many different configurations of your program, differing only in a set of input parameters. This is common when doing Monte Carlo simulation, for instance. It's usually best to start out doing this as a series of independent serial jobs. It ''is'' possible to implement this kind of loosely-coupled parallelism using MPI, but often less efficient and more difficult.
 
Also, the preceding answer pertains only to the idea of running a ''single'' program faster using parallelism. Often, you might want to run many different configurations of your program, differing only in a set of input parameters. This is common when doing Monte Carlo simulation, for instance. It's usually best to start out doing this as a series of independent serial jobs. It ''is'' possible to implement this kind of loosely-coupled parallelism using MPI, but often less efficient and more difficult.
 
=== How can I have a quick test run of my program? ===
 
Debugging and development often require the ability to quickly test your program repeatedly.  At SHARCNET we facilitate this work by providing a pre-emptive testing queue on some of our clusters, and a set of interactive development nodes on the larger clusters. 
 
 
The test queue is highly recommended for most test cases as it is convenient and prepares one for eventually working in the production environment.  Unfortunately the test queue is only available on Requin, Goblin and Kraken.  Development nodes allow users to work interactively with their program outside of the job scheduling system and production environment, but we only set aside a limited number of them on the larger clusters.  The rest of this section will only address the test queue, for more information on development nodes see the [[Kraken]], [[Orca]] or [[Saw]] cluster pages.
 
 
The test queue allows one to quickly test their program in the job environment to ensure that the job will start properly, and can be useful for debugging.  It also has the benefit that it will allow you to debug any size of job.  Do not abuse the test queue as it will have an impact on your fairshare job scheduling priority and it has to interrupt other user's production jobs temporarily, slowing down other users of the system.
 
 
Note that the flag for submitting to the test queue is provided '''in addition''' to the regular queue selection flag.  If you are submitting a MPI job to the test queue, both <tt>-q mpi</tt> and <tt>-t</tt> should be provided.  If you omit the <tt>-q</tt> flag, you may get odd errors about libraries not being found, as without knowing the type of job, the system simply doesn't know how to start your program correctly. 
 
 
To perform a test run, use sqsub option <tt>--test</tt> or <tt>-t</tt>. For example, if you have an MPI program mytest that uses 8 processors, you may use the following command
 
 
sqsub --test -q mpi -n 8 -o mytest.log ./mytest
 
 
The only difference here is the addition of the "<tt>--test</tt>" flag (note <tt>-q</tt> appears as would be normal for the job). The scheduler will normally start such test jobs within a few seconds.
 
 
The main purpose of the test queue is quickly verify the startup of a changed job - just to test that for a real, production run, it won't hit a bug shortly after starting due to, for instance, missing parameters.
 
 
The "test queue" only allows a job to run for a short period of time (currently 1 hour), therefore you must make sure that your test run will not take longer than this to finish.  Only one test job may be run at a time.  In addition, the system monitors the user submissions and decreases the priority of submitted jobs over time within an internally defined time window. Hence if you keep submitting jobs as test runs, the waiting time before those jobs get started will be getting longer, or you will not be able to submit test jobs any more.  Test jobs are treated as "costing" four times as much as normal jobs.
 
 
=== Which system should I choose? ===
 
There are many clusters, many of them specialized in some way.  We provide an [http://www.sharcnet.ca/my/systems/clustermap interactive map] of SHARCNET systems on the web portal which visually presents a variety of criteria as a decision making aid.  In brief however, depending on the nature of your jobs, there may be a clear preference for which cluster is most appropriate:
 
;is your job serial?
 
:Kraken is probably the right choice, since it has a very large number of processors, and consequently has high throughput. Your job will probably run soonest if you submit it here.
 
;do you use a lot of memory?
 
:Orca or Hound is probably the right choice.
 
;does your MPI program utilize a lot of communication?
 
:Orca, Saw, Requin, Hound, Angel, Monk and Brown have the fastest networks, but it's worth trying Kraken if you aren't familiar with the specific differences between Quadrics, Myrinet and Infiniband.
 
;does your job (or set of jobs) do a lot of disk IO?
 
:you probably want to stick to one of the major clusters (Orca/Requin/Saw) which have bigger and much faster (parallel) filesystems.
 
  
 
=== Where can I find available resources? ===
 
=== Where can I find available resources? ===
Line 485: Line 96:
 
* systems we maintain are listed on the SHARCNET web site's [https://www.sharcnet.ca/my/systems systems page] and on the [https://www.sharcnet.ca/my/perf_cluster/cur_perf cluster performance page], as well as,
 
* systems we maintain are listed on the SHARCNET web site's [https://www.sharcnet.ca/my/systems systems page] and on the [https://www.sharcnet.ca/my/perf_cluster/cur_perf cluster performance page], as well as,
 
* national Compute Canada systems/clouds via [http://status.computecanada.ca/ national systems status page] (click on the system/cloud links to get more information about a system).
 
* national Compute Canada systems/clouds via [http://status.computecanada.ca/ national systems status page] (click on the system/cloud links to get more information about a system).
 
<blockquote>
 
<B>SHARCNET Legacy Systems</B>
 
 
The change of status of each system, such as down time, power outage, etc is announced through the following three different channels:
 
 
* '''Web links under [https://www.sharcnet.ca/my/perf_cluster/cur_perf systems]'''. You need to check the web site from time to time in order to catch such public announcements.
 
 
* '''System notice mailing list'''. This is the passive way of being informed. You receive the notices in e-mail as soon as they are announced. But some people might feel it is annoying to be informed. Also, such notices may be buried in dozens or hundreds of other e-mail messages in your mail box, hence are easily ignored.
 
 
* '''SHARCNET [https://www.sharcnet.ca/my/news/rss RSS broadcasting]'''. A good analogy of RSS is like traffic information on the radio. When you are on a road trip and you want to know what the traffic conditions are ahead, you turn on the car radio, tune-in to a traffic news station and listen to updates periodically. Similarly, if you want to know the status of SHARCNET systems or the latest SHARCNET news, events and workshops, you can turn to RSS feeds on your desktop computer.
 
 
The following feeds SHARCNET RSS feeds are available :
 
*[https://www.sharcnet.ca/my/news/system_update_rss system update]     
 
*[https://www.sharcnet.ca/my/news/news_events_rss news and events] 
 
*[https://www.sharcnet.ca/my/news/in_the_news_rss in the news]
 
 
The term RSS may stand for Really Simple Syndication, RDF Site Summary, or Rich Site Summary depending on the version. Written in the format of XML, RSS feeds are  used by websites to syndicate their content. RSS feeds allow you to read through the news you want, at your own convenience. The messages will show up on you desktop, e.g. using [http://www.mozilla.org/en-US/thunderbird/ Mozilla Thunderbird], an integrated mail client software, as soon as there is an update.  If you have a Gmail, a convenient RSS access option may be [http://reader.google.com  Google Reader]
 
</blockquote>
 
  
 
=== Can I find my job submission history? ===
 
=== Can I find my job submission history? ===
Line 516: Line 108:
 
=== How many jobs can I submit in one cluster? ===
 
=== How many jobs can I submit in one cluster? ===
  
max_user_queable=1000
+
Currently Graham has a limit of 1000 submitted jobs per user.
 
+
This means 1000 jobs max, either running or queued. Once jobs finish running, they can submit more upto the max again.
+
  
 
=== How are jobs scheduled? ===
 
=== How are jobs scheduled? ===
Line 526: Line 116:
 
==== How long will it take for my queued job to start? ====
 
==== How long will it take for my queued job to start? ====
  
In practice, if your potential job does not cause you to exceed your user certification per-user process limit and there are enough free resources to satisfy the processor and memory layout you've requested for your job, and no one else has any jobs queued, then you should expect your jobs to start immediately. Once there are more jobs queued than available resources, the scheduler will attempt to arbitrate between the resource (CPU, memory, walltime) demands of all queued jobs. This arbitration happens in the following order: Dedicated Resource jobs first, then "test" jobs (which may also preempt normal jobs), and finally normal jobs. Within the set of pending normal jobs, the scheduler will prefer jobs belonging to groups which have high Fairshare priority (see below).
+
On national Compute Canada systems and systems running with Slurm, you can see the estimated time your queued jobs will start by running:
  
For information on expected queue wait times, users can check the [https://www.sharcnet.ca/my/perf_cluster/cur_perf Recent Cluster Statistics] table in the web portal.  This is historical data and may not correspond to the current job load on the cluster, but it is useful for identifying longer-term trends.  The idea is that if you are waiting unduly long on a particular cluster for your jobs to start, you may be able to find another similar cluster where the waittime is shorter.
+
  squeue --start -u USER
  
Although it is not possible to predict the start time of queued job with much accuracy there are some tools that can be used while logged into the systems that can help estimate a relevant wait time range for your specific jobs.
+
and replace USER with the name of the account that submitted the job.
 
+
First of all it is important to gather information about the current state of the scheduling queue. By exploring the currently running and queued jobs in the queue you can get a general picture of how busy the system is. With these tools it is also possible to get a more specific picture of queue times for jobs that are similar to your jobs in terms of resource requests. Because the resource requests of a job play a major role in dictating its wait time it is important to base queue time estimates on jobs that have similar requests.
+
 
+
The program showq can be used to view the jobs that are currently running and queued on many system:
+
 
+
$ showq
+
+
active jobs------------------------
+
JOBID              USERNAME      STATE PROCS  REMAINING            STARTTIME
+
...
+
 
+
For more detailed information about the queued jobs use
+
 
+
$ showq -i
+
+
eligible jobs----------------------
+
JOBID                PRIORITY  XFACTOR  Q  USERNAME    GROUP  PROCS    WCLIMIT    CLASS      SYSTEMQUEUETIME
+
...
+
 
+
A more general listing of queue information can also be obtained using qstat as follows:
+
 
+
$ qstat
+
Job id                    Name            User            Time Use S Queue
+
------------------------- ---------------- --------------- -------- - -----
+
...
+
 
+
 
+
Once that the queue has been explored, further details about specific jobs can be obtained to provide more information in the task of estimating queue time. In many instances it is useful to filter the jobs displayed to only show jobs with specific characteristics that relate to a job type of interest. For instance, all of the queued mpi jobs can be listed by calling:
+
 
+
$ sqjobs -aaq --queue mpi
+
  jobid    user queue state ncpus  time command
+
------- -------- ----- ----- ----- ------ -------
+
...
+
 
+
Note that the --queue option to sqjobs , beyond filtering to the standard serial, threaded, mpi and gpu queues, can also filter the output for jobs in specific NRAP queues. This can be particularly important information in the task of managing use within resource allocation projects.
+
 
+
Once that specific jobs have been identified in the queue that share resource requests with the type of job that you would like to get queue time estimates for (e.g. 32 process mpi job) you can obtain more details about specific jobs by calling:
+
 
+
$ sqjobs -l [jobid]
+
key                value
+
------------------ -----
+
jobid:            ...
+
queue:            ...
+
ncpus:            ...
+
nodes:            ...
+
command:          ...
+
working directory: ...
+
out file:          ...
+
state:            ...
+
submitted:        ...
+
started:          ...
+
should end:        ...
+
elapsed:          ...
+
cpu time:          ...
+
virtual memory:    ...
+
real/virt mem:    ...
+
 
+
+
  jobid    user queue state ncpus  time command
+
------- -------- ----- ----- ----- ------ -------
+
...
+
 
+
... or further calling:
+
 
+
$ qstat -f [jobid]
+
Job Id: ...
+
    Job_Name = ...
+
    Job_Owner = ...
+
    resources_used.cput = ...
+
    resources_used.mem = ...
+
    resources_used.vmem = ...
+
    resources_used.walltime = ...
+
    job_state = ...
+
    queue = ...
+
    server = ...
+
    Account_Name = ...
+
    Checkpoint = ..
+
    ctime = ...
+
    Error_Path = ...
+
    exec_host = ...
+
    Hold_Types = ...
+
    Join_Path = ...
+
    Keep_Files = ...
+
    Mail_Points = ...
+
    mtime = ...
+
    Output_Path = ...
+
    Priority = ...
+
    qtime = ...
+
    Rerunable = ...
+
    Resource_List.cput = ...
+
    Resource_List.procs = ...
+
    Resource_List.pvmem = ...
+
    Resource_List.walltime = ...
+
    session_id = ...
+
    Shell_Path_List = ...
+
    etime = ...
+
    submit_args = ...
+
    start_time = ...
+
    Walltime.Remaining = ...
+
    start_count = ...
+
    fault_tolerant = ...
+
    submit_host = ...
+
    init_work_dir = ...
+
 
+
 
+
Even though there is rich information to use from the scheduling queue to use towards building estimates of future job wait times there is no way estimate queue wait times with certainty as the scheduling queue if a very dynamic process in which influential properties change on every scheduling cycle. Further there are many parameters to consider not only of the jobs currently queued and running but also on thr priority ranking of the submitting user and group.
+
 
+
Another way to minimize your queue waittime is to submit smaller jobs.  Typically it is harder for the scheduler to free up resources for larger jobs (in terms of number of cpus, number of nodes, and memory per process), and as such smaller jobs do not wait as long in the queue.  The best approach is to measure the scaling efficiency of your code to find the ''sweet spot'' where your job finishes in a reasonable amount of time but waits for the least amount of time in the queue.  Please see this [[Measuring_Parallel_Scaling_Performance|tutorial]] for more information on parallel scaling performance and how to measure it effectively.
+
  
 
==== What determines my job priority relative to other groups?  ====
 
==== What determines my job priority relative to other groups?  ====
  
The priority of different jobs on the systems is ranked according to the usage by the entire group, across SHARCNET. This system is called <i>Fairshare</i>.
+
The priority of different jobs on the systems is ranked according to the usage by the entire group. This system is called <i>Fairshare</i>. More detail is available [https://docs.computecanada.ca/wiki/Job_scheduling_policies here].
 
+
Fairshare is based on a measure of recent (currently, past 2 months) resource usage. All user groups are ranked into 5 priority levels, with the heaviest users given lowest priority. You can examine your group's recent usage and priority here: [https://www.sharcnet.ca/my/profile/mysn Research Group's Usage and Priority].
+
 
+
This system exists to allow for new and/or light users to get their jobs running without having to wait in the queue while more resource consuming groups monopolize the systems.
+
  
 
==== Why did my job get suspended? ====
 
==== Why did my job get suspended? ====
Line 658: Line 136:
 
==== My job cannot allocate memory ====
 
==== My job cannot allocate memory ====
  
The default memory is usually 2G on most clusters. If your job requires more memory and is failing with a message "Cannot allocate memory", you should try adding the ""--mpp=4g" flag to your sqsub command, with the value (in this case 4g - 4 gigabytes) set large enough to accommodate your job.
+
If you did not specify the amount of memory your job needs when you submitted the job, resubmit the job specifying the amount of memory it needs.
 
+
Memory is a limited resource, so jobs requesting more memory will likely wait longer in the queue before running.  Hence, it is to the user's advantage to provide an accurate estimate of the memory needed.
+
 
+
Let us say your matlab program is called main.exe, and that you'd like to log your output in main.out ; to submit this job for 5 hours you'd use sqsub like:
+
 
+
sqsub -o main.out -r 5h ./main.exe
+
 
+
By default it will be attributed an amount of memory dependent on which system you are using (1GB on orca). To increase the amount of memory to 2GB, for example, add "--mpp=2G":
+
 
+
sqsub --mpp=2G -o main.out -r 5h ./main.exe
+
 
+
If that still doesn't work you can try increasing it further.
+
 
+
Furthermore, you can change the requested memory for a '''queued''' job with the command ''qalter'' (in this example to 5 GB):
+
 
+
qalter -l pvmem=5160m jobID
+
  
where jobID would be replaced by the actual ID of a job.
+
If you specifyed the amount of memory your job needed when it was submitted, then the memory requested was completely consumed. Resubmit your job with a larger memory request. (If this exceeds the available memory desired, then you will have to make your job use less memory.)
  
 
==== Some specific scheduling idiosyncrasies:====
 
==== Some specific scheduling idiosyncrasies:====
Line 682: Line 144:
 
One problem with cluster scheduling is that for a typical mix of job types (serial, threaded, various-sized MPI), the scheduler will rarely accumulate enough free CPUs at once to start any larger job. When an job completes, it frees N cpus. If there's an N-cpu job queued (and of appropriate priority), it'll be run. Frequently, jobs smaller than N will start instead. This may still give 100% utilization, but each of those jobs will complete, probably at different times, effectively fragmenting the N into several smaller sets. Only a period of idleness (lack of queued smaller jobs) will allow enough cpus to collect to let larger jobs run.
 
One problem with cluster scheduling is that for a typical mix of job types (serial, threaded, various-sized MPI), the scheduler will rarely accumulate enough free CPUs at once to start any larger job. When an job completes, it frees N cpus. If there's an N-cpu job queued (and of appropriate priority), it'll be run. Frequently, jobs smaller than N will start instead. This may still give 100% utilization, but each of those jobs will complete, probably at different times, effectively fragmenting the N into several smaller sets. Only a period of idleness (lack of queued smaller jobs) will allow enough cpus to collect to let larger jobs run.
  
Requin is intended to enable "capability", or very large jobs. Rather than eliminating the ability to run more modest job sizes, Requin is configured with a weekly cycle: every Monday at noon, all previously running jobs will have finished and large queued jobs can start. One implication of this is that no job over 1 week can be run (and a 1-week job will only have one chance per week to start). Shorter jobs can be started at any time, but only a 1-day job can be started on Sunday, for instance.
+
Note that clusters enforce runtime limits - if the job is still running at the end of the stated limit, it will be terminated.  Note also that when a job is suspended (preempted), this runtime clock stops: suspended time doesn't count, so it really is a limit on "time spent running", not elapsed/wallclock time.
 
+
Note that all clusters now enforce runtime limits - if the job is still running at the end of the stated limit, it will be terminated.  Note also that when a job is suspended (preempted), this runtime clock stops: suspended time doesn't count, so it really is a limit on "time spent running", not elapsed/wallclock time.
+
 
+
Finally, when running DDT or OPT (debugger and profiler), it's normal to use the test queue. If you need to run such jobs longer than 1 hour, and find the wait times too high when using the normal queues, let us know (open a ticket). It may be that we need to provide a special queue for these uses - possibly preemptive like the test queue.
+
  
 
===How do I run the same command on multiple clusters simultaneously?===
 
===How do I run the same command on multiple clusters simultaneously?===
  
If you're using bash and can login to sharcnet with authentication agent connection forwarding (the ''-A'' flag; ie. you've set up ssh keys; see [[Choosing_A_Password#Use_SSH_Keys_Instead.21]] for a starting point) add the following environment variable and function to your ''<code>~/.bashrc</code>'' shell configuration file:
+
If you're using bash and can login with the SSH authentication agent connection forwarding enabled (the ''-A'' flag; ie. you've set up ssh keys; see [[Choosing_A_Password#Use_SSH_Keys_Instead.21]] for a starting point) add the following environment variable and function to your ''<code>~/.bashrc</code>'' shell configuration file:
  
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 696: Line 154:
 
|-
 
|-
 
|<source lang="bash">
 
|<source lang="bash">
export SN_CLUSTERS="goblin kraken mako orca requin saw"
+
export SYSTEMS_I_NEED="graham.computecanada.ca orca.computecanada.ca"
 
   
 
   
 
function clusterExec {
 
function clusterExec {
   for clus in $SN_CLUSTERS; do
+
   for clus in $SYSTEMS_I_NEED; do
 
     ping -q -w 1 $clus &> /dev/null
 
     ping -q -w 1 $clus &> /dev/null
 
     if [ $? = "0" ]; then echo ">>> "$clus":"; echo ""; ssh $clus ". ~/.bashrc; $1"; else echo ">>> "$clus down; echo ""; fi
 
     if [ $? = "0" ]; then echo ">>> "$clus":"; echo ""; ssh $clus ". ~/.bashrc; $1"; else echo ">>> "$clus down; echo ""; fi
Line 707: Line 165:
 
|}
 
|}
  
You can select the relevant systems in the SN_CLUSTERS environment variable.
+
You can select the relevant systems in the SYSTEMS_I_NEED environment variable.
  
 
To use this function, reset your shell environment (ie. log out and back in again), then run:
 
To use this function, reset your shell environment (ie. log out and back in again), then run:
Line 719: Line 177:
 
===How do I load different modules on different clusters?===
 
===How do I load different modules on different clusters?===
  
SHARCNET provides environment variables named <code>$CLUSTER</code>, which is the systems hostname (without sharcnet.ca), as well as <code>$CLU</code> which will resolve to a three-character identifier that is unique for each system (typically the first three letters of the clusters name). You can use these in your ~/.bashrc to only load certain software on a particular system, but not others. For example, you can create a ''case'' statement in your ~/.bashrc shell configuration file based on the value of $CLUSTER :
+
SHARCNET maintained systems provide the environment variables named:
 
+
* <code>$CLUSTER</code>, which is the system's hostname (without sharcnet.ca or computecanada.ca), and
 +
* <code>$CLU</code> which will resolve to a three-character identifier that is unique for each system (typically the first three letters of the clusters name).
 +
You can use these in your ~/.bashrc to load certain software on a particular system, but not others. For example, you can create a ''case'' statement in your ~/.bashrc shell configuration file based on the value of $CLUSTER:
  
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 726: Line 186:
 
|-
 
|-
 
|<source lang="bash">
 
|<source lang="bash">
case $CLUSTER in
+
case $CLU in
   orca)
+
   orc)
  #load intel v11.1.069 when on orca instead of the default
+
    # load 2014.6 Intel compiler...
  module unload intel
+
    module unload intel
  module load intel/11.1.069
+
    module load intel/2014.6
 
   ;;
 
   ;;
   mako)
+
   gra)
  #alias vim to vi on mako, as the former isn't installed
+
    # load 2018.3 Intel compiler...
     alias vim=vi
+
     module load intel/2018.3
 
   ;;
 
   ;;
 
   *)
 
   *)
     #Anything we want to end up in "other" here....
+
     # This runs if nothing else matched.
 
   ;;
 
   ;;
 
esac
 
esac
 
</source>
 
</source>
 
|}
 
|}
 
One can use $CLU as it is shorter and more convenient.  Instead of a case statement one can just conditionally load or unload certain software on particular systems by inserting lines like the following in their ~/.bashrc :
 
 
{| border="0" cellpadding="5" cellspacing="0" align="center"
 
! style="background:#8AA8E5;" | ''~/.bashrc configuration'': <tt>loading different modules on different systems</tt>
 
|-
 
|<source lang="bash">
 
#to load gromacs only on saw:
 
if [ $CLU == 'saw' ]; then module load gromacs; fi
 
#to load octave on any system except saw:
 
if [ $CLU != 'saw' ]; then module load octave; fi
 
</source>
 
|}
 
 
=== I can't run jobs because I'm overquota? ===
 
 
If you exceed your <b>/work</b> disk quota on our systems you will be placed into a special "overquota" group and will be unable to run jobs.  SHARCNET's disk monitoring system runs periodically (typically O(day)) so if you have just cleaned up your files you may have to wait until it runs again to update your quota status.  One can see their current quota status from the system's point of view by running:
 
 
 
  quota $USER
 
 
If you can't submit jobs even after the system has updated your status it is likely because you are logged into an old shell which still shows you in the overquota unix group.  Log out and back in again and then you should be able to submit jobs.
 
 
If you're cleaning up and not sure how much space you are using on a particular filesystem, then you will want to use the ''du'' command, eg.
 
 
  du -h --max-depth=1 /work/$USER
 
 
This will count space used by each directory in ''/work/$USER'' and the total space, and present it in a human-readable format.
 
 
''For more detailed information please see the [[Using Storage]] article.''
 
 
=== I can't run 'java' on SHARCNET cluster? ===
 
 
Due to the way memory limits are implemented on the clusters, you will need to be specifying the maximum memory allocation pool for the Java JVM at the time you invoke it.
 
 
You do this with the -XmxNNNN command-line argument, where NNNN is the desired size of the allocation pool. Note that this number should always be within any memory limits being imposed by the scheduler (on orca compute nodes, that default limit would be 1GB per process).
 
 
The login nodes are explicitly limited to 1GB of allocation for any process, so you will need to run java or javac specifying a maximum memory pool smaller than 1GB. For example:
 
 
Running it normally (as in your example, I get same error):
 
 
orc-login2:~% java
 
Error occurred during initialization of VM
 
Could not reserve enough space for object heap
 
Could not create the Java virtual machine.
 
 
Specify small maximum memory allocation:
 
 
orc-login2:~% java -Xmx512m
 
Usage: java [-options] class [args...]
 
                      (to execute a class)
 
      or java [-options] -jar jarfile [args...]
 
                      (to execute a jar file)
 
 
where options include:
 
        -d32 use a 32-bit data model if available
 
        ...
 
 
As you can see, explicitly limiting the memory allocation pool to 512MB here has it running as expected.
 

Latest revision as of 15:15, 8 February 2019

Compiling and Running Programs

For information about compiling your programs on orca, graham and other national Compute Canada systems, please see the Installing software in your home directory page on Compute Canada wiki.

For information about how to compile on older SHARCNET systems, see Legacy Systems.

How do I run a program interactively?

For running interactive jobs on graham and other national systems, see Running jobs page on Compute Canada wiki.

If trying interactive jobs on legacy systems, see Legacy Systems.

My application runs on Windows, can I run it on SHARCNET?

It depends. If your application is written in a high level language such as C, C++ and Fortran and is system independent (meaning it does not depend on any particular third party libraries that are available only for Windows), then you should be able to recompile and run your application on SHARCNET systems. However, if your application completely depends upon a special software for Windows, it will not run on the Linux compute nodes. In general it is impossible to convert code at binary level between Windows and any of UNIX platforms. For options relating to running Windows in virtual machines there is a Creating a Windows VM page at the Compute Canada Wiki.

My application runs on Windows HPC clusters, can I run it on SHARCNET clusters?

If your application does not use any Windows specific APIs then it should be able to recompile and run on SHARCNET UNIX/Linux based clusters.

My program needs to run for more than seven (7) days; what can I do?

The seven day run-time limit on legacy systems cannot be exceeded. This is done to primarily encourage the practice of checkpointing, but it also prevents users from monopolizing large amounts of resources outside of dedicated allocations with long running jobs, ensures that jobs free up nodes often enough for the scheduler to start large jobs in a modest amount of time, and allows us to drain all systems for maintenance within a reasonable time-frame.

In order to run a program that requires more than this amount of wall-clock time, you will have to make use of a checkpoint/restart mechanism so that the program can periodically save its state and be resubmitted to the queues, picking up from where it left off. It is crucial to store checkpoints so that one can avoid lengthy delays in obtaining results in the event of a failure. Investing time in testing and ensuring that one's checkpoint/resume works properly is inconvenient but ensures that valuable time and electricity are not wasted unduly in the long run. Redoing a long calculation is expensive.

Although it is encourage to always use checkpointing for log running work loads, there are a small number of nodes available for 28 day run times on the national general purpose systems Graham and Cedar.

Handling long jobs with chained job submission

On systems that use the Slurm scheduler (e.g. Orca and Graham) job dependencies can be implemented such that the start of one job can be contingent on the completion of another job. This job contingency is expressed via the dependency optional input to sbatch expressed as follows in the job submit script:

    dependency=afterok:<jobid>

Other strategies for resubmitting jobs for long running computations on the Slurm scheduled systems are described on the Compute Canada Wiki.

How do I checkpoint/restart my program?

Checkpointing is a valuable strategy that minimizes the loss of valuable compute time should a long running job be unexpectedly killed by a power outage, node failure, or hitting its runtime limit. On the national systems checkpointing can be accomplished manually by creating and loading your own custom checkpoint files or by using the Distributed MultiThreaded CheckPointing (DMTCP) software without having to recompile your program. For further documentation of the checkpointing and DMTCP software see the Checkpoints page at the Compute Canada Wiki site.

If your program is MPI based (or any other type of program requiring a specialized job starter to get it running), it will have to be coded specifically to save state and restart from that state on its own. Please check the documentation that accompanies any software you are using to see what support it has for checkpointing. If the code has been written from scratch, you will need to build checkpointing functionality into it yourself---output all relevant parameters and state such that the program can be subsequently restarted, reading in those saved values and picking up where it left off.

How can I know when my job would start?

The Slurm scheduler can report expected start times for queued jobs as output from the squeue command. For example the follow command returns the current jobs for user 'username' with columns for job id, job name, start time (N/A if there is no estimate), and job state:

$ squeue -u username -o "%.10i%.24j%.12T%.24S%.24R"
    JOBID                    NAME       STATE              START_TIME        NODELIST(REASON)
 12345678                  mpi.sh     PENDING                     N/A              (Priority)

It is important to note that the estimated start time listed in the START_TIME column (if available) can change substantially over time. This start time estimate is based on the current state of the compute nodes and list of jobs in the queue. Because the state of the compute nodes and list of jobs in the queue are constantly changing the start time estimates for pending jobs can change for several reasons (running jobs end sooner than expected, higher priority jobs enter the queue, etc). For more information regarding the variables that affect wait times in the queue see the job scheduling policy page at the Compute Canada Wiki site.

Is package X preinstalled on system Y, and, if so, how do I run it?

The software packages that are installed and maintained on the national systems are listed at the available software page of the Compute Canada Wiki site. Some packages have specific documentation for running on the national systems. For the packages that have specific Compute Canada instructions follow the link in the 'Documentation' column of the list of globally installed modules table.

For legacy SHARCNET systems the list of preinstalled packages (with running instructions) can be found on the SHARCNET software page.

Command 'top' gives me two different memory size (virt, res). What is the difference between 'virtual' and 'real' memory?

'virt' refers to the total virtual address space of the process, including virtual space that has been allocated but never actually instantiated, including memory which was instantiated but has been swapped out, and memory which may be shared. 'res' is memory which is actually resident - that is, instantiated with real ram pages. resident memory is normally the more meaningful value, since it may be judged relative to the memory available on the node. (recognizing, of course, that the memory on a node must be divided among the resident pages for all the processes, so an individual thread must always strive to keep its working set a little smaller than the node's total memory divided by the number of processors.)

there are two cases where the virtual address space size is significant. one is when the process is thrashing - that is, has a working set size bigger than available memory. such a process will spend a lot of time in 'D' state, since it's waiting for pages to be swapped in or out. a node on which this is happening will have a substantial paging rate expressed in the 'si' column of output from vmstat (the 'so' column is normally less significant, since si/so do not necessarily balance.)

the second condition where virtual size matters is that the kernel does not implement RLIMIT_RSS, but does enforce RLIMIT_AS (virtual size). we intend to enforce a sanity-check RLIMIT_AS, and in some cases do. the goal is to avoid a node becoming unusable or crashing when a job uses too much memory. current settings are very conservative, though - 150% of physical memory.

in this particular case, the huge V size relative to R is almost certainly due to the way Silky implements MPI using shared memory. such memory is counted as part of every process involved, but obviously does not mean that N * 26.2 GB of ram is in use.

in this case, the real memory footprint of the MPI rank is 1.2 GB - if you ran the same code on another cluster which didn't have numalink shared memory, both resident and virtual sizes would be about that much. since most of our clusters have at least 2GB per core, this code could run comfortably on other clusters.

Can I use a script to compile and run programs?

Yes. For instance, suppose you have a number of source files main.f, sub1.f, sub2.f, ..., subN.f, to compile these source code to generate an executable myprog, it's likely that you will type the following command

ifort main.f sub1.f sub2.f ... sub N.f -llapack -o myprog 

Here, the -o option specifies the executable name myprog rather than the default a.out and the option -llapack at the end tells the compiler to link your program against the LAPACK library, if LAPACK routines are called in your program. If you have long list of files, typing the above command every time can be really annoying. You can instead put the command in a file, say, mycomp, then make mycomp executable by typing the following command

chmod +x mycomp

Then you can just type

./mycomp

at the command line to compile your program.

This is a simple way to minimize typing, but it may wind up recompiling code which has not changed. A widely used improvement, especially for larger/many source files, is to use make. make permits recompilation of only those source files which have changed since last compilation, minimizing the time spent waiting for the compiler. On the other hand, compilers will often produce faster code if they're given all the sources at once (as above).

I have a program that runs on my workstation, how can I have it run in parallel?

If the the program was written without parallelism in mind, then there is very little that you can do to run it automatically in parallel. Some compilers are able to translate some serial portion of a program , such as loops, into equivalent parallel code, which allows you to explore the potential architecture found mostly in symmetric multiprocessing (SMP) systems. Also, some libraries are able to use parallelism internally, without any change in the user's program. For this to work, your program needs to spend most of its time in the library, of course - the parallel library doesn't speed up your program itself. Examples of this include threaded linear algebra and FFT libraries.

However, to gain the true parallelism and scalability, you will need to either rewrite the code using the message passing interface (MPI) library or annotate your program using OpenMP directives. We will be happy to help you parallelize your code if you wish. (Note that OpenMP is inherently limited by the size of a single node or SMP machine - most SHARCNET resources

Also, the preceding answer pertains only to the idea of running a single program faster using parallelism. Often, you might want to run many different configurations of your program, differing only in a set of input parameters. This is common when doing Monte Carlo simulation, for instance. It's usually best to start out doing this as a series of independent serial jobs. It is possible to implement this kind of loosely-coupled parallelism using MPI, but often less efficient and more difficult.

Where can I find available resources?

The information about available computational resources are available to the public as follows:

Can I find my job submission history?

Yes, for SHARCNET maintained legacy systems, you may review the history by logging in to your web account.

For national Compute Canada systems and systems running Slurm, you can see your job submission history from a specific date YYYY-MM-DD by running the following command:

 sacct --starttime YYYY-MM-DD --format=User,JobID%15,Jobname%25,partition%25,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

where YYYY-MM-DD is replaced with the appropriate date.

How many jobs can I submit in one cluster?

Currently Graham has a limit of 1000 submitted jobs per user.

How are jobs scheduled?

Job scheduling is the mechanism which selects waiting jobs ("queued") to be started ("dispatched") on nodes in the cluster. On all of the major SHARCNET production clusters, resources are "exclusively" scheduled so that a job will have complete access to the CPUs, GPUs or memory that it is currently running on (it may be pre-empted during the course of it's execution, as noted below). Details as to how jobs are scheduled follow below.

How long will it take for my queued job to start?

On national Compute Canada systems and systems running with Slurm, you can see the estimated time your queued jobs will start by running:

 squeue --start -u USER

and replace USER with the name of the account that submitted the job.

What determines my job priority relative to other groups?

The priority of different jobs on the systems is ranked according to the usage by the entire group. This system is called Fairshare. More detail is available here.

Why did my job get suspended?

Sometimes your job may appear to be in a running state, yet nothing is happening and it isn't producing the expected output. In this case the job has probably been suspended to allow another job to run in it's place briefly.

Jobs are sometimes preempted (put into a suspended state) if another higher-priority job must be started. Normally, preemption happens only for "test" jobs, which are fairly short (always less than 1 hour). After being preempted, a job will be automatically resumed (and the intervening period is not counted as usage.)

On contributed systems, the PI who contributed equipment and their group have high-priority access and their jobs will preempt non-contributor jobs if there are no free processors.

My job cannot allocate memory

If you did not specify the amount of memory your job needs when you submitted the job, resubmit the job specifying the amount of memory it needs.

If you specifyed the amount of memory your job needed when it was submitted, then the memory requested was completely consumed. Resubmit your job with a larger memory request. (If this exceeds the available memory desired, then you will have to make your job use less memory.)

Some specific scheduling idiosyncrasies:

One problem with cluster scheduling is that for a typical mix of job types (serial, threaded, various-sized MPI), the scheduler will rarely accumulate enough free CPUs at once to start any larger job. When an job completes, it frees N cpus. If there's an N-cpu job queued (and of appropriate priority), it'll be run. Frequently, jobs smaller than N will start instead. This may still give 100% utilization, but each of those jobs will complete, probably at different times, effectively fragmenting the N into several smaller sets. Only a period of idleness (lack of queued smaller jobs) will allow enough cpus to collect to let larger jobs run.

Note that clusters enforce runtime limits - if the job is still running at the end of the stated limit, it will be terminated. Note also that when a job is suspended (preempted), this runtime clock stops: suspended time doesn't count, so it really is a limit on "time spent running", not elapsed/wallclock time.

How do I run the same command on multiple clusters simultaneously?

If you're using bash and can login with the SSH authentication agent connection forwarding enabled (the -A flag; ie. you've set up ssh keys; see Choosing_A_Password#Use_SSH_Keys_Instead.21 for a starting point) add the following environment variable and function to your ~/.bashrc shell configuration file:

~/.bashrc configuration: multiple cluster command
export SYSTEMS_I_NEED="graham.computecanada.ca orca.computecanada.ca"
 
function clusterExec {
  for clus in $SYSTEMS_I_NEED; do
     ping -q -w 1 $clus &> /dev/null
     if [ $? = "0" ]; then echo ">>> "$clus":"; echo ""; ssh $clus ". ~/.bashrc; $1"; else echo ">>> "$clus down; echo ""; fi
   done
}

You can select the relevant systems in the SYSTEMS_I_NEED environment variable.

To use this function, reset your shell environment (ie. log out and back in again), then run:

clusterExec uptime

You will see the uptime on the cluster login nodes, otherwise the cluster will appear down.

If you have old host keys (not sure why these should change...) then you'll have to clean out your ~/.ssh/known_hosts file and repopulate it with the new keys. If you suspect a problem contact an administrator for key validation or email help@sharcnet.ca. For more information see Knowledge_Base#SSH_tells_me_SOMEONE_IS_DOING_SOMETHING_NASTY.21.3F.

How do I load different modules on different clusters?

SHARCNET maintained systems provide the environment variables named:

  • $CLUSTER, which is the system's hostname (without sharcnet.ca or computecanada.ca), and
  • $CLU which will resolve to a three-character identifier that is unique for each system (typically the first three letters of the clusters name).

You can use these in your ~/.bashrc to load certain software on a particular system, but not others. For example, you can create a case statement in your ~/.bashrc shell configuration file based on the value of $CLUSTER:

~/.bashrc configuration: loading different modules on different systems
case $CLU in
  orc)
    # load 2014.6 Intel compiler...
    module unload intel
    module load intel/2014.6
  ;;
  gra)
    # load 2018.3 Intel compiler...
    module load intel/2018.3
  ;;
  *)
    # This runs if nothing else matched.
  ;;
esac