From Documentation
Jump to: navigation, search
(Quick start guide)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
=Overview=
 
=Overview=
META package is a suite of scripts designed in-house to fully automate throughput computing (serial/parallel/GPU farming) in SHARCNET. They will work on National systems (Graham, Cedar etc), and other clusters which use the same setup (e.g. orca). The same set of scripts can be used with little or no modifications to organize almost any type of farming workflow, including
+
META package is a suite of scripts designed in-house to fully automate throughput computing (serial/parallel/GPU farming). They will work on National systems (Graham, Cedar etc), and other clusters which use the same setup. The same set of scripts can be used with little or no modifications to organize almost any type of farming workflow, including
  
 
* either "one case per job" mode or "many cases per job" mode, with dynamic workload balancing for the latter;
 
* either "one case per job" mode or "many cases per job" mode, with dynamic workload balancing for the latter;
Line 16: Line 16:
  
 
* Login to the cluster.
 
* Login to the cluster.
* Copy directory /home/syam/META  to your file space (normally in your project space):
+
* Use "git" to clone our META repository:
$ cp -pr /home/syam/META  ~
+
* Alternatively, you can use "git" to clone our META repository:
+
 
  $ git clone git@git.sharcnet.ca:syam/META.git
 
  $ git clone git@git.sharcnet.ca:syam/META.git
 
* Create directory ~/bin if you don't have one:
 
* Create directory ~/bin if you don't have one:
Line 41: Line 39:
 
==Overview==
 
==Overview==
 
Let's call a single execution of the code in a serial/parallel farm a “case”. When the total number of cases, N_cases, is fairly small (say, <500) it is convenient to dedicate a separate job to each case. (You should make sure that each case runs for at least 10 minutes. If this is not the case, you should consider the "many cases per job" mode - see below.)
 
Let's call a single execution of the code in a serial/parallel farm a “case”. When the total number of cases, N_cases, is fairly small (say, <500) it is convenient to dedicate a separate job to each case. (You should make sure that each case runs for at least 10 minutes. If this is not the case, you should consider the "many cases per job" mode - see below.)
 
We created a set of BASH scripts utilizing both the “one case – one job” approach (described in this section; works for cases when the number of jobs is <500 or so, and when the runtime for each case is at least 10 minutes) and “many cases per job” approach (described in [[#Large number of cases]]; best for larger number of cases and/or shorter case runtimes). The scripts can be found on Graham and Cedar in the following directory:
 
 
<source lang="bash">
 
  ~syam/META
 
</source>
 
  
 
The three essential scripts are “submit.run”, “single_case.sh”, and "job_script.sh".
 
The three essential scripts are “submit.run”, “single_case.sh”, and "job_script.sh".
Line 142: Line 134:
 
</source>
 
</source>
  
In the above example, $STATUS will be positive if either code exit status is positive, or "out.dat" file doesn't exists or is empty.
+
In the above example, $STATUS will be positive if either code exit status is positive, or "out.dat" file doesn't exist or is empty.
  
 
==job_script.sh script==
 
==job_script.sh script==
Line 163: Line 155:
 
'''Important:''' your job_script.sh file must include the runtime switch (either -t or --time). This cannot be passed to sbatch as an optional argument to submit.run.
 
'''Important:''' your job_script.sh file must include the runtime switch (either -t or --time). This cannot be passed to sbatch as an optional argument to submit.run.
  
Sometimes the following problem happens: one of the meta-jobs is allocated on a node which has an issue causing your code to fail instantly (e.g., no GPU is available, and your code needs a GPU; or project file space is not mounted). This is definitely not normal, and issues like this need to be reported to SHARCNET. But if it does happen, then your single bad meta-job can churn quickly through table.dat, so your whole farm fails. As a precaution, one can add a testing routine in job_scrip.sh, before the "task.run" line. For example, the following code will test for the presence of a GPU, and forces the meta-job to exit if none are present - before it started failing your serial farm cases:
+
Sometimes the following problem happens: one of the meta-jobs is allocated on a node which has an issue causing your code to fail instantly (e.g., no GPU is available, and your code needs a GPU; or project file space is not mounted). This is definitely not normal, and issues like this need to be reported to Compute Canada. But if it does happen, then your single bad meta-job can churn quickly through table.dat, so your whole farm fails. As a precaution, one can add a testing routine in job_scrip.sh, before the "task.run" line. For example, the following code will test for the presence of a GPU, and forces the meta-job to exit if none are present - before it started failing your serial farm cases:
  
 
<source lang="bash">
 
<source lang="bash">
Line 175: Line 167:
 
</source>
 
</source>
  
You can copy the utility "gpu_test" to your ~/bin directory:
+
You can copy the utility "gpu_test" to your ~/bin directory (only on graham and cedar):
  
 
  cp ~syam/bin/gpu_test ~/bin
 
  cp ~syam/bin/gpu_test ~/bin
Line 203: Line 195:
 
* '''Status.run''' (capital “S”!) will list statuses of all processed cases. With the optional "-f" switch, the non-zero status lines (if any) will be listed at the end.
 
* '''Status.run''' (capital “S”!) will list statuses of all processed cases. With the optional "-f" switch, the non-zero status lines (if any) will be listed at the end.
  
* '''clean.run''': will delete all the files in the current directory (including subdirectories if any present), except for *.run scripts, job_script.sh, table.dat, and bin subdirectory. Be very careful with this script! Note: the script will not restore *.run scripts to their default state (for that, you'd need to copy *.run scripts again from /home/syam/META directory).
+
* '''clean.run''': will delete all the files in the current directory (including subdirectories if any present), except for *.run scripts, job_script.sh, table.dat, and bin subdirectory. Be very careful with this script! Note: the script will not restore *.run scripts to their default state.
  
 
All of these commands (and also the (re)submit.run commands) have to be executed inside the subdirectory corresponding to this particular farm. If you run more than one farm, each of them has to have its own subdirectory, with its own versions of single_case.sh and job_script.sh files.
 
All of these commands (and also the (re)submit.run commands) have to be executed inside the subdirectory corresponding to this particular farm. If you run more than one farm, each of them has to have its own subdirectory, with its own versions of single_case.sh and job_script.sh files.
Line 273: Line 265:
 
With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.
 
With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.
  
For particularly large farms, if the number of jobs in the above analysis is larger than 1000 (the maximum number of jobs which can be submitted on Graham), the workaround would be to split the workload (the table.dat file) into smaller sections, and run them as separate farms, one after another.
+
For particularly large farms, if the number of jobs in the above analysis is larger than 1000 (the maximum number of jobs which can be submitted on Graham), the workaround would be to go through the sequence of commands (each command can only be executed after the previous farm has finished running):
 +
 
 +
<source lang="bash">
 +
  $  submit.run 1000
 +
  $  resubmit.run 1000
 +
  $  resubmit.run 1000
 +
  ...  
 +
</source>
  
 
==Runtime problem==
 
==Runtime problem==
Line 332: Line 331:
 
</source>
 
</source>
  
It is also a good idea to copy my utility ~syam/bin/gpu_test to your ~/bin directory, and put the following lines in your job_script.sh file right before the "task.run" line:
+
It is also a good idea to copy my utility ~syam/bin/gpu_test to your ~/bin directory (only on graham and cedar), and put the following lines in your job_script.sh file right before the "task.run" line:
  
 
<source lang="bash">
 
<source lang="bash">
Line 478: Line 477:
 
...
 
...
 
</source>
 
</source>
 +
 +
= If more help is needed =
 +
 +
Submit a ticket to Compute Canada ticketing system (by sending an email to support@computecanada.ca), mentioning the name of the package (META), and the name of the staff who wrote the software (Sergey Mashchenko).
  
 
[[Category:Tutorials]]
 
[[Category:Tutorials]]

Latest revision as of 15:24, 21 August 2019

Overview

META package is a suite of scripts designed in-house to fully automate throughput computing (serial/parallel/GPU farming). They will work on National systems (Graham, Cedar etc), and other clusters which use the same setup. The same set of scripts can be used with little or no modifications to organize almost any type of farming workflow, including

  • either "one case per job" mode or "many cases per job" mode, with dynamic workload balancing for the latter;
  • capturing the exit status of all individual jobs;
  • the ability to automatically resubmit all the jobs which failed or never ran;
  • the ability to submit and independently operate multiple serial farms on the same cluster.

The key points about the package:

  • All serial farming jobs to be computed have to be described as separate lines (one line per job) in the file table.dat in the farm-specific directory. (One can run multiple farms independently; each farm has to have its own directory.)
  • In the "many cases per job" mode, the number of actual jobs (so-called "meta-jobs") submitted by the package is usually much smaller than the number of cases to process. Each meta-job can process multiple lines from table.dat (multiple cases). Running meta-jobs read lines from table.dat, starting from the first one, in a serialized manner (using lockfile mechanism to prevent race condition). This ensures a good dynamic worlkload balance between meta-jobs, as meta-jobs which happen to handle shorter cases would process more of them.
  • Not all meta-jobs need to ever run in the "many cases per job" mode. The first meta-job to run will start processing lines from table.dat; if/when the second job starts, it joins the first one, and so on. If runtime of individual meta-jobs is long enough, it is possible to process all the cases with just a single running meta-job.

Quick start guide

If you are impatient to start using the package, just follow the steps listed below. But it is highly recommended to also read the other details on this page.

  • Login to the cluster.
  • Use "git" to clone our META repository:
$ git clone git@git.sharcnet.ca:syam/META.git
  • Create directory ~/bin if you don't have one:
$ mkdir ~/bin
  • Move all the files inside META/bin subdirectory to ~/bin:
$ mv META/bin/* ~/bin
  • Add ~/bin to your $PATH variable (you can add the line below at the end of your ~/.bashrc file):
$ export PATH=/home/$USER/bin:$PATH
$ submit.run -1

for the one case per job mode, or

$ submit.run N

for the many cases per job mode (N is the number of meta-jobs to use).

  • To run another farm concurrently with the first one, create another directory - say, META1 - and copy there and customize the files single_case.sh and job_script.sh, and create a new table.dat file there. Also copy the code executable and all the input files as needed. Now you can execute "submit.run N" inside META1 to submit the second farm.
  • To use any of the provided *.run utilities, one has first to cd to the corresponding farm subdirectory.

Small number of cases

Overview

Let's call a single execution of the code in a serial/parallel farm a “case”. When the total number of cases, N_cases, is fairly small (say, <500) it is convenient to dedicate a separate job to each case. (You should make sure that each case runs for at least 10 minutes. If this is not the case, you should consider the "many cases per job" mode - see below.)

The three essential scripts are “submit.run”, “single_case.sh”, and "job_script.sh".

submit.run script

“submit.run” has one obligatory command line argument - number of jobs to submit, which when used in the “one case per job” mode should be “-1”, e.g.

   $ submit.run -1 [optional_arguments]

The value of "-1" means "submit as many jobs as there are lines in table.dat".

All optional_arguments (there could more than one) will be passed to the job submitting command, sbatch. They will be used with all meta-jobs submitted for this farm.

single_case.sh script

Another principal script, “single_case.sh”, is only one of the two scripts (the other one being job_script.sh) which might need customization. Its task is to read the corresponding line from table.dat, parse it, and use these data to launch your code for this particular case. The version of the file provided literally executes one full line from the case table (meaning that the line should start with the path to your code, or with the code binary name if the binary is on your $PATH environment variable) in a separate subdirectory, RUNyyy (yyy being the case number).

“single_case.sh”:

...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $SLURM_JOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
 
mkdir -p RUN$ID
cd RUN$ID
 
echo "Case $ID:"
 
# Executing the command (a line from table.dat)
# It's allowed to use more than one shell command (separated by semi-columns) on a single line
eval "$COMM"
 
# Exit status of the code:
STATUS=$?
 
cd ..
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

Your table.dat can look like this:

 /home/user/bin/code1  1.0  10  2.1
 cp -f /input_dir/input1 .; /code_dir/code 
 ./code2 < IC.2
 sleep 10m
 ...

In other words, any executable statement(s) which can be written on one line can go there. Note: if you have more then one command on a single line, separated by semicolumn(s), you must use the provided format:

eval "$COMM"

If there is only one command per line (which can include redirects), it is okay to use the simpler form:

$COMM

Often you'd want to edit the single_case.sh file to use there the code path/name explicitly, in which case your table.dat file will only contain command line switch(es) for your code and/or redirects. For example:

  • single_case.sh:
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
...
# Here we use $ID (case number) as a unique seed for Monte-Carlo type serial farming:
/path/to/your/code -par $COMM  -seed $ID
...
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  • table.dat:
 12.56
 21.35
 ...

Another note: “submit.run” will modify your table.dat once (will add line number at the beginning of each line, if you didn't do it yourself). The file table.dat can be used with submit.run either way (with or without the first case_ID column) - this is handled automatically.

Handling code exit status

What is “$STATUS” for in “single_case.sh”? It is a shell variable which should be set to “0” if your case was computed correctly, and >0 otherwise. It is very important: it is used by “resubmit.run” to figure out which cases failed, so they can be re-computed. In the provided version of “single_case.sh”, $STATUS only reflects the exit code of your program. This likely won't cover all potential problems. (Some codes do not exit with non-zero status even if something went wrong.) You can always change or augment $STATUS derivation in “single_case.sh”. E.g., if your code is supposed to create a new non-empty file (say, “out.dat”) at the very end of each case run, the existence of such a non-empty file can be used to judge if the case failed or not:

  STATUS=$?
  if test ! -s out.dat
     then
     STATUS=1
     fi

In the above example, $STATUS will be positive if either code exit status is positive, or "out.dat" file doesn't exist or is empty.

job_script.sh script

The file job_script.sh is the SLURM job script which will be used by all meta-jobs in your serial farm. It can look like this:

#!/bin/bash
# Here you should provide the sbatch arguments to be used in all jobs in this serial farm
# It has to contain the runtime switch (either -t or --time):
#SBATCH -t 0-00:10
#SBATCH --mem=1000
#SBATCH -A def-user
 
# Don't change this line:
task.run

At the very least you'd have to change the account name (the "-A" switch), and the meta-job runtime ("-t" switch). In the "one case per job" mode, you should request the job's runtime to be somewhat larger than the longest expected individual case run.

Important: your job_script.sh file must include the runtime switch (either -t or --time). This cannot be passed to sbatch as an optional argument to submit.run.

Sometimes the following problem happens: one of the meta-jobs is allocated on a node which has an issue causing your code to fail instantly (e.g., no GPU is available, and your code needs a GPU; or project file space is not mounted). This is definitely not normal, and issues like this need to be reported to Compute Canada. But if it does happen, then your single bad meta-job can churn quickly through table.dat, so your whole farm fails. As a precaution, one can add a testing routine in job_scrip.sh, before the "task.run" line. For example, the following code will test for the presence of a GPU, and forces the meta-job to exit if none are present - before it started failing your serial farm cases:

gpu_test
retVal=$?
if [ $retVal -ne 0 ]; then
    exit 1
fi
 
task.run

You can copy the utility "gpu_test" to your ~/bin directory (only on graham and cedar):

cp ~syam/bin/gpu_test ~/bin

Output files

Once one or more meta-jobs in your farm are running, the following files will be created in the farm directory:

  • slurm-jobid.out files (one file per meta-job): standard output from jobs;
  • status.jobid files (one file per meta-job): files containing the status of processed cases.

In both cases, jobid stands for the jobid of the corresponding meta-job.

Also, every "submit.run" script execution will create a unique subdirectory inside "/home/$USER/tmp". Inside that subdirectory, some small scratch files (like files used by "lockfile" command, to serialize certain operations inside the jobs) will be created. These subdirectories have names "NODE.PID", where "NODE" is the name of the current node (typically a login node), and "PID" is the unique process ID for the script. Once the farm execution is done, one can safely erase this subdirectory.

Users normally don't need to access these files directly.

Auxiliary scripts

Other auxiliary scripts are also provided for your convenience.

  • list.run will list all the jobs with their current state for the serial farm (no arguments).
  • query.run will provide a one line summary (number of queued / running / done jobs) in the farm, which is more convenient than using “list.run” when the number of jobs is large. It will also “prune” queued jobs if warranted (see below).
  • kill.run: will kill all the running/queued jobs in the farm.
  • prune.run: will only kill (remove) queued jobs.
  • Status.run (capital “S”!) will list statuses of all processed cases. With the optional "-f" switch, the non-zero status lines (if any) will be listed at the end.
  • clean.run: will delete all the files in the current directory (including subdirectories if any present), except for *.run scripts, job_script.sh, table.dat, and bin subdirectory. Be very careful with this script! Note: the script will not restore *.run scripts to their default state.

All of these commands (and also the (re)submit.run commands) have to be executed inside the subdirectory corresponding to this particular farm. If you run more than one farm, each of them has to have its own subdirectory, with its own versions of single_case.sh and job_script.sh files.

Resubmitting failed/never-run jobs

Finally, script “resubmit.run” is run the same way as “submit.run”, e.g.:

   $  resubmit.run -1 [optional_arguments]

“resubmit.run”:

  • will analyze all those status.* files (#Output files);
  • figure out which cases failed and which never ran for whatever reason (e.g. because of the meta-jobs' runtime limit);
  • create a new case table (adding “_” at the end of the original table name), which lists only the cases which still need to be run;
  • uses “submit.run” internally to launch a new farm, for the unfinished/failed jobs.

Notes: You won't be able to run “resubmit.run” until all the jobs from the original run are done or killed. If some cases still fail or do not run, one can resubmit the farm as many times as needed, with the same arguments as before.

Of course, if certain cases persistently fail, then there must a be a problem with either your initial conditions parameters or with your code (a code bug). It is convenient to use here the script "Status.run" (capital S!) to see a sorted list of statuses for all computed cases. With the optional argument "-f", the Status.run command will sort the output according to the exit status, showing non-zero status lines (if any) at the bottom, to make them easier to spot.

Large number of cases

Overview

The “one case per job” works fine when the number of cases is fairly small (<500). When N_cases >> 500, the following problems arise:

  • Each cluster has a limit on how many jobs a user can submit (for Graham, it is 1000).
  • Job submission becomes very slow. (With 1000 jobs and ~4s per job submission, the submission will last one hour).
  • With very large number of cases, each case run is typically short. If one case runs for <20 min, you start wasting cpu cycles due to scheduling overheads.

The solution: instead of submitting a separate job for each case, one should submit a smaller number of "meta-jobs", each of which would process multiple cases. As cases can take different time to process, it is highly desirable to utilize a dynamic workload balancing scheme here.

This is how it is implemented:

Meta1.png

As the above diagram shows, "submit.run" script in the "many cases per job" mode will submit N jobs, with N being a fairly small number (much smaller than the number of cases to process). Each job would execute the same script - "task.run". Inside that script, there is a "while" loop, for different cases. Each iteration of the loop has to go through a serialized (only one job at a time can do that) portion of the code, where it figures out which next case (if any) to process. Then the already familiar script "single_case.sh" (see section #single_case.sh script) is executed - once per each case, which in turn calls the user code.

This approach results in dynamic workload balancing achieved across all the running "meta-jobs" belonging to the same farm. This can be seen more clearly in the diagram below:

Meta2.png

The dynamic workload balancing results in all meta-jobs finishing around the same time, regardless of how different the runtimes are for individual cases, regardless of how fast CPUs are on different nodes, and regardless of whether all "meta-jobs" start at the same time (as the above diagram shows), or start at different times (which would normally be the case). In addition, this approach is very robust: not all meta-jobs need to start running for all the cases to be processed; if a meta-job dies (due to a node crash), at most one case will be lost. (The latter can be easily rectified by running the "resubmit.run" script; see #Resubmitting failed/never-run jobs.)

To enable the “multiple cases per job” mode (with dynamic workload balancing), the first argument to “submit.run” script should be the desired number of meta-jobs, e.g.:

   $  submit.run  32

Not all of the requested meta-jobs will necessarily run (this depends on how busy the cluster is). But as described above, in the "many cases per job" mode you will eventually get all your results regardless of how many meta-jobs will run. (You might need to run "resubmit.run", sometimes more than once, to complete particularly large serial farms).

Estimating the runtime and number of meta-jobs

How to figure out the optimum number of meta-jobs, and the runtime (to be used in job_script.sh)?

First you need to figure out what is the average runtime for an individual case (a single line in table.dat). One way to do it is to allocate a cpu with salloc command, cd to the farm directory, and execute the single_case.sh script there multiple times, for different cases, measuring the total runtime, and then dividing that by the number of cases to get an estimate of the average case runtime. This can be conveniently achieved with a bash "for" loop:

   $  N=20; time for ((i=1; i<=$N; i++)); do  ./single_case.sh table.dat $i  ; done

The "real" time obtained with the above command should be divided by $N (20 in this example) to get the average case runtime estimate. Let's call it dt_case (in seconds).

Then you can estimate the amount of cpu cycles needed to process the whole farm, by multiplying dt_case by the number of cases (number of lines in table.dat). This will be in cpu-seconds. Dividing that by 3600 gives you the amount of compute resources in cpu-hours. Multiply that by something like 1.1 - 1.3 to have a bit of a safety margin.

Now you can make a sensible choice for the runtime of meta-jobs, and that will also give you the number of meta-jobs needed to finish the whole farm.

The runtime you choose should be significantly larger (ideally by a factor of 100 or more) than the average runtime of individual cases. In any case, it should definitely be larger than the expected longest individual case runtime. On the other hand, it should not be too large (say, no more than 3 days) to avoid very long queue wait time (the longer the job's runtime is, the smaller is the number of cluster's nodes available for such jobs). A good choice would be either 12h or 1d. Once you settled on the runtime, you can divide the farm's cpu cycles amount (in cpu-hours) by the meta-job's runtime (in hours) to get the required number of meta-jobs (should be rounded up to the next larger integer number).

With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.

For particularly large farms, if the number of jobs in the above analysis is larger than 1000 (the maximum number of jobs which can be submitted on Graham), the workaround would be to go through the sequence of commands (each command can only be executed after the previous farm has finished running):

   $  submit.run 1000
   $  resubmit.run 1000
   $  resubmit.run 1000
   ...

Runtime problem

Here is one potential problem when one is running multiple cases per job, utilizing dynamic workload balancing: what if the number of running meta-jobs times the requested runtime per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3 day runtime limit. That means that your serial farm can only process all the cases in a single run if the average_case_runtime x N_cases < 1000 x 3d = 3000 cpu days. (In less perfect cases, you will be able to run < 1000 meta-jobs, resulting in even smaller number of cpu days your farm can process). Once your meta-jobs start hitting the 3d runtime limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of accounting (the "resubmit.run" will find all the cases which failed or never ran, and will resubmit them automatically). But this can become a waste of cpu cycles, because many of your cases are dying half-way through. On average, you will be wasting 0.5 x N_jobs x average_case_runtime cpu-days. E.g. if your cases have an average runtime of 1 hour, and you have 1000 meta-jobs running, you will waste ~20 cpu days, which is not acceptable.

Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in the "task.run" script as follows:

  • The script measures runtime of each case, and adds the value as one line in a scratch file "times" created inside /home/$USER/tmp/NODE.PID directory (see #Output files). This is done by all running meta-jobs.
  • Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file "times" and compute the larger 12.5% quantile for the current distribution of case runtimes. This will serve as a conservative estimate of the runtime for your individual cases, t_runtime.
  • From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that t_finish - t_now > t_runtime. (Here t_finish is the time when the job will die because of the job's runtime limit; t_now is the current time.) If it thinks it doesn't have the time, it will exit early, which will minimize the chance of a case computation aborted half-way due to the job's runtime limit.
  • At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) t_runtime is recomputed using the above algorithm. This will make t_runtime estimate more and more accurate. Power of two is used to minimize the overheads related to computing t_runtime; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
  • The above algorithm reduces the amount of cpu cycles wasted due to jobs hitting the runtime limit by a factor of 8.

Additional information

Passing additional sbatch arguments

What if you need to use additional sbatch arguments (like --mem 4G, --gres=gpu:1 etc.)? Simple: just add all those arguments at the end of “submit.run” and “resubmit.run” command line, and they will be passed to sbatch, e.g.:

   $  submit.run  -1  --mem 4G

Alternatively, you can supply these arguments as separate "#SBATCH" lines in your job_script.sh file.

Multi-threaded farming

For “multi-threaded farming” (OpenMP etc.), add "--cpus-per-task=N" and "--mem=XXX" sbatch arguments to “(re)submit.run” (or add the corresponding #SBATCH lines to your job_script.sh file). Here “N” is the number of cpu cores/threads to use. Also, add the following line inside your “job_script.sh” file right before the task.run line:

   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

MPI farming

For “MPI farming”, use these sbatch arguments with “(re)submit.run” (or add the corresponding #SBATCH lines to your job_script.sh file):

   --ntasks=N  --mem-per-cpu=XXX

Also add “srun” before the path to your code inside “single_case.sh”, e.g.:

   srun  $COMM

Alternatively, you can prepend “srun” on each line of your table.dat:

   srun /path/to/mpi_code arg1 arg2
   srun /path/to/mpi_code arg1 arg2
   ...
   srun /path/to/mpi_code arg1 arg2

GPU farming

For GPU farming, you only need to modify your job_script.sh file accordingly. For example, for farming where the code uses one GPU add one extra line:

#SBATCH --gres=gpu:1

It is also a good idea to copy my utility ~syam/bin/gpu_test to your ~/bin directory (only on graham and cedar), and put the following lines in your job_script.sh file right before the "task.run" line:

gpu_test
retVal=$?
if [ $retVal -ne 0 ]; then
    echo "No GPU found - exiting..."
    exit 1
fi

This will catch those quite rare situations when there is a technical issue with the node rendering the GPU being not available. If that happens to one of your meta-jobs, and you don't have the above lines in your script, the rogue meta-job will churn through (and fail) all your cases from table.dat.

FORTRAN code example: using standard input

You have a FORTRAN (or C/C++) serial code, “fcode”; each case needs to read a separate file from standard input – say “data.xxx” (in /home/user/IC directory), where xxx goes from 1 to N_cases. Place “fcode” on your $PATH (e.g., in ~/bin, make sure /home/$USER/bin is added to $PATH in .bashrc; alternatively, use a full path to your code in the cases table). Create table.dat (inside META directory) like this:

  fcode < /home/user/IC/data.1
  fcode < /home/user/IC/data.2
  ...
  fcode < /home/user/IC/data.N_cases

The task of creating the table can be greatly simplified if you use a BASH loop command, e.g.:

   $  for ((i=1; i<=10; i++)); do echo "fcode < /home/user/IC/data.$i"; done >table.dat

FORTRAN code example: copying files for each case

Another typical FORTRAN code situation: you need to copy a file (say, /path/to/data.xxx) to each case subdirectory, before executing the code, and rename it to some standard input file name. Your table.dat can look like this:

  /path/to/code
  /path/to/code
  ...

Add one line (first line in the example below) to your “single_case.sh”:

   \cp /path/to/data.$ID standard_name
   $COMM
   STATUS=$?

Using all the columns in the cases table explicitly

The examples shown so far presume that each line in the cases table is an executable statement (except for the first column which is added automatically by the scripts and contains the line #), starting with either the code binary name (when the binary is on your $PATH) or full path to the binary, and then listing the code's command line arguments (if any) particular to that case, or something like " < input.$ID" if your code expects the initial conditions via standard input.

In the most general case, one wants to have the ultimate flexibility in being able to access all the columns in the table individually. That is easy to achieve by slightly modifying the "single_case.sh" script:

...
# ++++++++++++  This part can be customized:  ++++++++++++++++
#  $ID contains the case id from the original table
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
cd RUN$ID
 
# Converting $COMM to an array:
COMM=( $COMM )
# Number of columns in COMM:
Ncol=${#COMM[@]}
# Now one can access the columns individually, as ${COMM[i]} , where i=0...$Ncol-1
# A range of columns can be accessed as ${COMM[@]:i:n} , where i is the first column
# to display, and n is the number of columns to display
# Use the ${COMM[@]:i} syntax to display all the columns starting from the i-th column
# (use for codes with a variable number of command line arguments).
 
# Call the user code here.
...
 
# Exit status of the code:
STATUS=$?
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

For example, you need to provide to your code both an initial conditions file (to be used via standard input), and a variable number of command line arguments. Your cases table will look like this:

  /path/to/IC.1 0.1
  /path/to/IC.2 0.2 10
  ...

The way to implement this in "single_case.sh" is as follows:

# Call the user code here.
/path/to/code ${COMM[@]:1} < ${COMM[0]}

Troubleshooting

Here we explain typical error messages you might get when using this package.

Problems with submit.run

"lockfile is not on path; exiting"

Make sure the utility lockfile is on your $PATH.

"File table.dat doesn't exist. Exiting"

You forgot to create the table.dat file in the current directory, or perhaps you are running submit.run not inside one of your farm sub-directories.

"Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"

Make sure you provide the runtime for all meta-jobs as an #SBATCH argument inside your job_script.sh file. This is a requirement - the runtime sbatch argument is the only one which cannot be passed as an optional argument for submit.run.

"Wrong job runtime in job_script.sh - nnn . Exiting"

You didn't format properly the runtime argument inside your job_script.sh file.

Problems with resubmit.run

"Jobs are still running/queued; cannot resubmit"

You cannot use resubmit.run until all meta-jobs from this farm finished running. Use list.run or queue.run to check the status of the farm.

"No failed/unfinished jobs; nothing to resubmit"

Not an error - simply tells you that your farm was 100% processed, and that there are no more (failed or never-ran) cases to compute.

Problems with running jobs

"Too many failed (very short) cases - exiting"

This happens if the first $N_failed_max (5 by default) cases are very short - less than $dt_failed (5 by default) seconds in duration. The two variables, $N_failed_max and $dt_failed, can be adjusted by editing the task.run script. This is a protections mechanism, in case anything is amiss - a problem with the node (file system not mounted, GPU is missing etc), with the job parameters (not enough of RAM etc), or with the code (the binary is missing or instantly crashing, input files are missing etc.). This protection will prevent the bad meta-job churning through (and failing) all the cases in table.dat.

"lockfile is not on path on node XXX"

As the error message suggests, somehow the utility lockfile is not on your $PATH - either you forgot to modify your $PATH variable accordingly, to copy lockfile into your ~/bin directory, or perhaps something is wrong on that particular compute node (home file system not mounted). The lockfile utility is critical for this package (it ensures serialized access of meta-jobs to the table.dat file), and it won't work if the utility is not accessible.

"Exiting after processing one case (-1 option)"

This is actually not an error - it simply tells you that you submitted the farm with via "submit.run -1" (one case per job mode), so each meta-job is exiting after processing a single case.

"Not enough runtime left; exiting."

This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of runtimes for all the cases processed so far), so it is exiting earlier.

"No cases left; exiting."

This is not an error message - this is how each meta-job normally finishes, when all cases have already been computed.

Words of caution

  • Always start with a much smaller test farm run, to make sure everything works, before submitting a large production run farm. You can test individual cases by reserving an interactive node with "salloc" command, cd'ing to the farm directory, and executing commands like "./single_case.sh table.dat 1", "./single_case.sh table.dat 2" etc.
  • If your farm is particularly large (say > 10,000 cases), extra efforts have to be spent to make sure it runs as efficiently as possible. In particular, you have to minimize number of files and/or directories created during the jobs execution. If possible, instruct your code to add to the existing results files (one per meta-job; do not mix results from different meta-jobs in a single output file!), instead of creating a separate results file for each case. Definitely avoid creating a separate subdirectory for each case (which is the default setup of this package). The following example (optimized for large number of cases) assumes that your code accepts the output file name via "-o" command line switch, that the output file is used in "append" mode (multiple code runs will keep adding to the existing file), and that each line of table.dat provides the rest of the command line switches for your code. It is also assumed that multiple instances of your code can safely run concurrently inside the same directory (so no need to create subdirectories for each case), and that each code run will not produce any other files (beside the output file). With this setup, even very large farms (hundreds of thousands or even millions of cases) should run fairly efficiently, as there will be very few files generated.
...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $SLURM_JOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
 
# Executing the command (a line from table.dat)
/path/to/your/code  $COMM  -o output.$SLURM_JOB_ID 
 
# Exit status of the code:
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

If more help is needed

Submit a ticket to Compute Canada ticketing system (by sending an email to support@computecanada.ca), mentioning the name of the package (META), and the name of the staff who wrote the software (Sergey Mashchenko).