From Documentation
Revision as of 13:16, 1 February 2019 by Syam (Talk | contribs) (Created page with "=Overview= In this section we describe our suite of scripts designed to fully automate throughput computing (serial/parallel farming) in SHARCNET. The same set of scripts can...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


In this section we describe our suite of scripts designed to fully automate throughput computing (serial/parallel farming) in SHARCNET. The same set of scripts can be used with little or no modifications to organize almost any type of farming workflow, including

  • either "one case per job" mode or "many cases per job" mode, with dynamic workload balancing;
  • capturing the exit status of all individual jobs;
  • the ability to automatically resubmit all the jobs which failed or never ran;
  • the support for serial, multi-threaded (OpenMP), MPI, CUDA etc. farming.

All serial farming jobs to be computed have to be described as separate lines (one line per job) in the file table.dat.

Small number of cases


Let's call a single execution of the code in a serial/parallel farm a “case”. When the total number of cases, N_cases, is fairly small (say, <500) it is convenient to dedicate a separate job to each case.

We created a set of BASH scripts utilizing both the “one case – one job” approach (described in this section; works for cases when the number of jobs is <500 or so) and “many cases per job” approach (described in #Large number of cases; best for larger number of cases). The scripts can be found on Graham in the following directory:


The three essential scripts are “”, “”, and "". script

“” has one obligatory command line argument - number of jobs to submit, which when used in the “one case per job” mode should be “-1”, e.g.

   [user@gra-login2 ~]$ ./ -1

The value of "-1" means "submit as many jobs as there are lines in table.dat". script

The other principal script, “”, is only one of the two scripts (the other one being which might need customization. Its task is to read the corresponding line from table.dat, parse it, and use these data to launch your code for this particular case. The version of the file provided literally executes one full line from the case table (meaning that the line should start with the path to your code, or with the code binary name if the binary is on your $PATH environment variable) in a separate subdirectory, RUNyyy (yyy being the case number).


# ++++++++++++  This part can be customized:  ++++++++++++++++
#  $ID contains the case id from the original table
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
# Executing the command:
$COMM &> out.log
# Exit status of the code:
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Your table.dat can look like this:

 /home/user/bin/code1  1.0  10  2.1
 ./code2 < IC.2
 sleep 10m

In other words, any executable statement which can be written on one line can go there. Note: “” will modify your cases table once (will add line number at the beginning of each line, if you didn't do it yourself). The file table.dat can be used with either way (with or without the first case_ID column) - this is handled automatically.

Handling code exit status

What is “$STATUS” for in “”? It is a shell variable which should be set to “0” if your case was computed correctly, and >0 otherwise. It is very important: it is used by “” to figure out which cases failed, so they can be re-computed. In the provided version of “”, $STATUS only reflects the exit code of your program. This likely won't cover all potential problems. (Some codes do not exit with non-zero status even if something went wrong.) You can always change or augment $STATUS derivation in “”. E.g., if your code is supposed to create a new non-empty file (say, “out.dat”) at the very end of each case run, the existence of such a non-empty file can be used to judge if the case failed or not:

  if test ! -s out.dat

In the above example, $STATUS will be positive if either code exit status is positive, or "out.dat" file doesn't exists or is empty.

Output files

The “” script will generate some files in the current directory:

  • slurm-jobid.out files (one file per meta-job): standard output from jobs;
  • status.jobid files (one file per meta-job): files containing the statuses of processed cases.

In both cases, jobid stands for the jobid of the corresponding meta-job.

Also, every "" script execution will create a unique subdirectory inside "/home/$USER/tmp" (it will be created if it doen't exist). Inside that subdirectory, some small scratch files (like files used by "lockfile" command, to serialize certain operations inside the jobs - see #Meta-cluster ("Grid") mode) will be created. These subdirectories have names "NODE.PID", where "NODE" is the name of the current node (typically a login node), and "PID" is the unique process ID for the script. Once the farm execution is done, one can safely erase this subdirectory.

Users normally don't need to access these files directly.

Auxiliary scripts

Other auxiliary scripts are also provided for your convenience.

  • “” will list all the jobs with their current state for the serial/parallel farm (no arguments).
  • “” will provide a one line summary (number of queued / running / done jobs) in the farm, which is more convenient than using “” when the number of jobs is large. It will also “prune” queued jobs if warranted (see below).
  • “”: will kill all the running/queued jobs in the farm.
  • “”: will only kill queued jobs.
  • “” (capital “S”!) will list statuses of all processed cases.
  • "": will delete all the files in the current directory (including subdirectories if any present), except for *.run scripts and test*.dat files. Be very careful with this script! Note: the script will not restore *.run scripts to their default state (for that, you'd need to copy *.run scripts again from /home/syam/Serial_farming/META directory).

Resubmitting failed/not-run jobs

Finally, script “” is run the same way as “”, e.g.:

   [user@orc-login2 ~]$  ./  /work/user/dir/cases.dat  -1


  • will analyze all those status.* files (#Output files);
  • figure out which cases failed and which never ran for whatever reason (e.g. because of the 7d runtime limit);
  • create a new case table (adding “_” at the end of the original table name), which lists only the cases which still need to be run;
  • uses “” internally to launch a new farm, for the unfinished/failed jobs.

Notes: You won't be able to run “” until all the jobs from the original run are done or killed. If some cases still fail or do not run, one can resubmit the farm again and again, with the same arguments as before:

   [user@orc-login2 ~]$  ./  /work/user/dir/cases.dat  -1

Of course, if certain cases persistantly fail, then there must a be a problem with either your initial conditions parameters or files, or with your code (a code bug). It is convenient to use here the script "" (capital S!) to see a sorted list of statuses for all computed cases.

Meta-cluster ("Grid") mode

Previous material described the situation when you run your farm on a single cluster. This is the default behaviour when the current directory doesn't have file “clusters.lst”. If the file “clusters.lst” is present and is not empty, the scripts switch to the “meta-cluster” mode.

Proper “clusters.lst” file should contain a list of clusters which you wish to use for you serial farm, one cluster per line. E.g.:


The current (local) cluster doesn't have to be in the list.

When you both use “-1” argument (one case per job mode) and provide “clusters.lst” file, “” and “” will submit 2*N_cases jobs across all the clusters in the list. Only the first N_cases jobs to start will do the computations; the rest will get pruned or die instantly.

This is how it works: The very first job which runs (on any cluster) will request a case to process (case #1 initially). The second job to run will ask for the next available case (#2), and so on. To accomplish this, certain operations need to be serialized (only one job at a time can do it):

  • Read “currently served” case number from a file.
  • Update it (add “1”), and write it back to the file.

To serialize file-related operations, you need a special binary program, “lockfile”. It is included in the META package. It should be placed in a directory listed in your $PATH environment variable. You can accomplish this by

   [user@orc-login2 ~]$ mkdir ~/bin
   [user@orc-login2 ~]$ cp lockfile ~/bin
   [user@orc-login2 ~]$ echo 'export PATH=/home/$USER/bin:$PATH' >> ~/.bashrc
   [user@orc-login2 ~]$ . ~/.bashrc

Some clusters listed in your “clusters.lst” file might not be usable - either not accessible (ssh timeout), ot have unresponsive scheduler (sqsub never returns). Our scripts deal with such problems intelligently:

  • The script will first try to submit one job to each cluster listed in “clusters.lst”.
  • Any cluster where it takes more than 10s to either ssh to the cluster or to submit a job there with sqsub, will be automatically deleted from “clusters.lst”, and will not be used in the current and subsequent farms launched from this directory (until you add the cluster name again to the file "clusters.lst").
  • All the relevant numbers (number of jobs per cluster etc.) will be adjusted on the spot, so you'll get the requested number of jobs even if one or more clusters time out (you'll just end up submitting more jobs per cluster).
  • If all but one cluster in the list time out, the scripts will automatically switch from "meta-cluster" to regular (single cluster, with no overcomitted jobs) mode.
  • If all the clusters in "clusters.lst" time out, the script will exit with a warning and non-zero exit code.

What is the advantage of using the "grid" mode, as opposed to running your farm on a single cluster? SHARCNET has quite a few clusters of different sizes. Many of these clusters are largerly overlooked by users, either because they are smaller, or because they are contributed systems, where regular users' jobs have a lower priority than the jobs from contributors. This results in our largest cluster, orca, being much busier than most of other systems (also because orca is heavily subscribed by NRAC projects, which have the highest priority), so your serial farming (and any other) jobs will spend much longer time waiting in the orca's queue than they would on some other clusters. It is not easy to know a priory which of the clusters would be able to run your serial farming jobs soonest. The "meta-cluster" mode of our farming scripts solves this problem empirically: it submits more jobs than you plan to run, to multiple clusters, and when the least busy clusters will launch enough of your jobs, the rest will get "pruned" (removed from the scheduler). So you get your results sooner, both because you are detecting empirically the least busy clusters, and also because you run jobs on multiple clusters, many of which are rather small and wouldn't be useful if you tried to run the whole farm on just one of them.

Large number of cases


The “one case per job” works fine when the number of cases is fairly small (<500). When N_cases >> 500, the following problems arise:

  • There is a limit on the number of jobs submitted (for orca, it is 5000).
  • Job submission becomes very slow. (With 5000 jobs and ~2s per job submission, the submission will last ~3 hours!).
  • With very large number of cases, each case run is typically short. If one case runs for <30 min, you start wasting cpu cycles due to scheduling overheads (~30s per job).

The solution: instead of submitting a separate job for each case, one should submit a smaller number of "meta-jobs", each of which would process multiple cases. As cases can take different time to process, it is highly desirable to utilize a dynamic workload balancing scheme here.

This is how it is implemented:


As the above diagram shows, "" script in the "many cases per job" mode will submit N jobs, with N being a fairly small number (much smaller than the number of cases to process). Each job would execute the same script - "". Inside that script, there is a "while" loop, for different cases. Each iteration of the loop has to go through a serialized (only one job at a time can do that) portion of the code, where it figures out which next case (if any) to process. Then the already familiar script "" (see section script) is executed - once per each case, which in turn calls the user code.

This approach results in dynamic workload balancing achieved across all the running "meta-jobs" belonging to the same farm. This can be seen more clearly in the diagram below:


The dynamic workload balancing results in all meta-jobs finishing around the same time, regardless of how different the runtimes are for individual cases, regardless of how fast CPUs are on different nodes, and regardless of whether all "meta-jobs" start at the same time (as the above diagram shows), or start at different times (which would normally be the case). In addition, this approach is very robust: not all meta-jobs need to start running for all the cases to be processed; if a meta-job dies (due to a node crash), at most one case will be lost. (The latter can be easily rectified by running the "" script; see #Resubmitting failed/not-run jobs.)

To enable the “multiple cases per job” mode (with dynamic workload balancing), the second argument to “” script should be the desired number of “” jobs, e.g.:

   [user@orc-login2 ~]$  ./  /work/user/dir/cases.dat  32

Not all of the requested meta-jobs will necessarily run (this depends on how busy the cluster(s) are). But as described above, in the "many cases per job" mode you will eventually get all your results regardless of how many meta-jobs will run. (You might need to run "", sometimes more than once, to complete particularly large serial farms).

If file “clusters.lst” is present, listing multiple clusters, one per line (which will enable the "meta-cluster" mode - see #Meta-cluster ("Grid") mode), the number of submitted jobs will be the above number times two, but is limited to 256*2=512 jobs.

Runtime problem

Here is one potential problem when one is running multiple cases per job, utilizing dynamic workload balancing: what if the number of running meta-jobs times the maximum allowed runtime per job in SHARCNET (7 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed number of running jobs in SHARCNET, 256, each of which has the 7 day runtime limit. That means that your serial farm can only process all the cases in a single run if the average_case_runtime x N_cases < 256 x 7d = 1792 cpu days. (In less perfect cases, you will be able to run < 256 meta-jobs, resulting in even smaller number of cpu days your farm can process). Once your meta-jobs start hitting the 7d runtime limit, they will start dying in the middle of processing one of your cases. This will result in up to 256 interrupted cases calculations. This is not a big deal in terms of accounting (the "" will find all the cases which failed or never ran, and will resubmit them automatically). But this can become a waste of cpu cycles, because many of your cases are dying half-way through. On average, you will be wasting 0.5 x N_jobs x average_case_runtime cpu-days. E.g. if your cases have an average runtime of 2 days, and you have 256 meta-jobs running, you will waste ~256 cpu days, which would not be acceptable.

Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in the "" script as follows:

  • The script measures runtime of each case, and adds the value as one line in a scratch file "times" created inside /home/$USER/tmp/NODE.PID directory (see #Output files). This is done by all running meta-jobs, on all clusters involved.
  • Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file "times" and compute the larger 12.5% quantile for the current distribution of case runtimes. This will serve as a conservative estimate of the runtime for your cases, t_runtime.
  • From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that t_finish - t_now > t_runtime. (Here t_finish is the time when the job will die because of the 7d runtime limit; t_now is the current time.) If it thinks it doesn't have the time, it will exit early, which will minimize the chance of a case computation aborted half-way due to the 7d runtime limit.
  • At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) t_runtime is recomputed using the above algorithm. This will make t_runtime estimate more and more accurate. Power of two is used to minimize the overheads related to computing t_runtime; the algorith will be equally efficient for both very small (tens) and very large (many thousands) number of cases.

Simple usage scenario

Additional information

Passing additional sqsub arguments

What if you need to use additional sqsub arguments (like --mpp, -q mpi, -q threaded, -n etc.)? Simple: just add all those arguments at the end of “” and “” command line, and they will be passed to sqsub, e.g.:

   [user@orc-login2 ~]$  ./  test.dat  -1  --mpp 4G

You can also override the default job runtime value (encoded to be 7 days in “”), by adding a “-r” argument.

Multi-threaded farming

How to do “multi-threaded farming” (OpenMP etc.)? Add these sqsub arguments to “(re)”:

   -q threaded -n N

Here “N” is the number of cpu cores/threads to use. Nothing special needs to be done inside “”: run the code as in the serial case.

MPI farming

What about “MPI farming”? Use these sqsub arguments with “(re)”:

   -q mpi  -n N  --nompirun

and add “mpirun” before the path to your code inside “”, e.g.:

   mpirun  $COMM &> out.log

Alternatively, you can prepend “mpirun” on each line of your cases table:

  mpirun /path/to/mpi_code arg1 arg2
  mpirun /path/to/mpi_code arg1 arg2
  mpirun /path/to/mpi_code arg1 arg2

FORTRAN code example: using standard input

You have a FORTRAN serial code, “fcode”; each case needs to read a separate file from standard input – say “” (in /work/user/IC directory), where xxx goes from 1 to N_cases. Place “fcode” on your $PATH (e.g., in ~/bin, make sure /home/$USER/bin is added to $PATH in .bashrc; alternatively, use a full path to your code in the cases table). Create the cases table (inside META directory) like this:

  fcode < /work/user/IC/data.1
  fcode < /work/user/IC/data.2
  fcode < /work/user/IC/data.N_cases

The task of creating the table can be greatly simplified if you use a BASH loop command, e.g.:

   [user@orc-login2 ~]$  for ((i=1; i<=10; i++)); do echo "fcode < /work/user/IC/data.$i"; done >table.dat

FORTRAN code example: copying files for each case

Finally, another typical FORTRAN code situation: you need to copy a file (say, /path/to/ to each case subdirectory, before executing the code. Your cases table can look like this:


Add one line (first line in the example below) to your “”:

   \cp /path/to/data.$ID .
   $COMM &> out.log

Using all the columns in the cases table explicitly

The examples shown so far presume that each line in the cases table is an executable statement (except for the first column which is added automatically by the scripts and contains the line #), starting with either the code binary name (when the binary is on your $PATH) or full path to the binary, and then listing the code's command line arguments (if any) particular to that case, or something like " < input.$ID" if your code expects the initial conditions via standard input.

In the most general case, one wants to have the ultimate flexibility in being able to access all the columns in the table individually. That is easy to achieve by slightly modifying the "" script:

# ++++++++++++  This part can be customized:  ++++++++++++++++
#  $ID contains the case id from the original table
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
# Converting $COMM to an array:
# Number of columns in COMM:
# Now one can access the columns individually, as ${COMM[i]} , where i=0...$Ncol-1
# A range of columns can be accessed as ${COMM[@]:i:n} , where i is the first column
# to display, and n is the number of columns to display
# Use the ${COMM[@]:i} syntax to display all the columns starting from the i-th column
# (use for codes with a variable number of command line arguments).
# Call the user code here.
# Exit status of the code:
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For example, you need to provide to your code both an initial conditions file (to be used via standard input), and a variable number of command line arguments. Your cases table will look like this:

  /path/to/IC.1 0.1
  /path/to/IC.2 0.2 10

The way to impliment this in "" is as follows:

# Call the user code here.
/path/to/code ${COMM[@]:1} < ${COMM[0]} &> out.log

How to use contributed systems?

Some contributed systems (like brown and redfin) apparently still accept 7 day jobs, so use them as any other cluster. Many contributed systems (perhaps all in the future) will only accept short – up to 4h – jobs. To make full advantage of the contributed systems, add “-r 4h” argument to “(re)”, and either use “-1” mode of “(re)” if your cases take between 0.5 and 4 hours to run each, or use “many cases per job” mode if each case takes <0.5h.

Word of caution

  • Please use common sense when using the “meta-cluster” mode of the scripts.

Most of the jobs you submit should be expected to run (don't submit many more jobs than you might need - “just in case”). The “$Overcommit_factor” variable defined in “” (=2) was introduced to force you to be reasonably compliant with the above point. After submitting a farm to multiple clusters, regularly check its state with “” script. It will keep you informed about the overall “health” of your farm, and if things don't look right you can do something about it. As importantly, “” script will run “” script internally when it detects that you won't need the jobs which are still queued (because you already hit the maximum number of running jobs limit).

  • Make sure your code only accesses /work or /home. This is critical when you use the "meta-cluster" mode, as /work and /home are the only SHARCNET's file systems which are accessible from any compute node on any cluster.
  • Always start with a much smaller test farm run, to make sure everything works, before submitting a large production run farm.
  • Stay away from orca (when using meta-cluster mode): it is very busy because of NRAC jobs. Don't include requin (the scripts don't work there).