|Note: This page's content only applies to the SHARCNET's legacy systems; it does not apply to Graham.|
|Description: A set of generic tools for submitting and managing jobs on SHARCNET clusters.|
|SHARCNET Package information: see SQ software page in web portal|
|Full list of SHARCNET supported software|
To submit a job, certain basic information must be provided:
- input/output/error file locations (where the program should read from, write to)
- the type of job: serial, threaded, MPI (this is close related to the concept of a queue)
- resource requirements of the job (such as number of cores or amount of memory)
- the command (what to do, such as running a program with certain arguments)
Fundamentally, a cluster's scheduler is trying to optimize throughput, with jobs completing correctly and without interfering with each other. Each node in a cluster has a fixed set of resources, and jobs must reserve cpus and memory for a period of time to complete. Jobs which exceed the time will be killed. Jobs will be refused access to memory exceeding their request, and won't be allowed onto cpus other than those made available by the scheduler.
It is often necessary to experiment with job parameters - for instance, you may not know accurately how long a program will take, or how much memory it requires. You should experiment by first submitting your job with generous estimates, and noticing the job's actual usage. If you can provide a tighter bound on your job's usage, it will be scheduled sooner (since higher resource usages normally take longer to accommodate.)
the simplest example is this:
sqsub -r10m -o out ./myprogram
this submits a job to run myprogram for up to 10 minutes, with its output directed to the file named out. this example utilizes a number of defaults, which may not be appropriate for a real job: this job will run in the serial queue, and use only 1 cpu, and will be limited to a default amount of memory (1GB on most clusters).
sqsub -r1d -o out.%J -m hostname
-m requests that the scheduler send email when the job ends (either successfully or not). This is actually a short form of --mail-end, which has a corresponding --mail-start; there is also --mail-abort for failures.
qsub -f xeon -r1d -o out.%J -m hostname
-f xeon requests that the job be run on a xeon node - the other option is -f opteron.
sqsub -q mpi --nompirun -n 4 ./myscript
by default, an mpi job is automatically run with the appropriate mpirun (and its machine and job-specific parameters). In this example, no such handling is provided, so the user's ./myscript must provide these details.
sqsub -r1d --depends 12345 -o out ./myprog
the scheduler will not consider starting this job until job 12345 is complete.
sqsub -r1d -q mpi --nodes=orc[1-2,4] -n 3 --ppn 1 -o out ./myprog
the scheduler will run this job on the three nodes named, with one cpu on each node.
sqsub -r1d -q threaded -n 8 --mpp 4G -o molecule.out g03 molecule.gjf
the program (Gaussian) will be run with 8 CPUs, and 4GB of memory. (Due to the way Gaussian handles memory, this should correspond with a %mem of about 2GB.
sqsub -r7d -q mpi -n 128 --ppn 4 --mpp 3G -i in -o out.%J -e err.%J ./mympiprog
this job will run across 32 nodes, 4 ranks on each, with 3GB per process. stdin for the program will be redirected from the file 'in', and stdout and stderr will go to corresponding files, each named by the jobid.