This page provides useful information for monitoring and acquiring information about jobs that are scheduled, running or have finished running at SHARCNET. This could include finding the nodes the job is running on, when it started running, whether or not it has been suspended during the course of it's execution, etc.
- 1 Job Scheduling
- 2 Job Identifier
- 3 Web Portal Jobs Database
- 4 Getting the system's view of a job
- 5 Inspecting running jobs
SHARCNET provides a unified interface to our various schedulers for users. This system is called SQ and you can read more about it in in the Knowledge Base. The rest of this document assumes you are familiar with SQ.
Behind SQ SHARCNET primarily deploys LSF (/SLURM) and Torque/Moab job schedulers and resource managers. These platforms do not behave identically so a user who wants to work on the command line at the machine must be familiar with the appropriate platform if they wish to understand how their jobs are handled.
Job Output File Behavior
On Torque/Moab systems job output is spooled and only copied to the final destination file at job completion. On LSF users can view their job output as the job is executing, modulo any buffering at the runtime/OS level.
All jobs that are submitted to run on SHARCNET systems are assigned a unique job identifier value, sometimes referred to as a Job ID or just jobid. This value is used by the system's job scheduler to keep track of the job, and it is also used by the back-end SHARCNET job database to identify the job. Knowing the Job ID allows one to look up information both on the system's job scheduler and in the job database.
Finding the Job Identifier
At job submission
When one submits a job with sqsub the JobID is returned on the command line, eg.
[merz@wha781 ~]$ sqsub -r 10m -q serial -t -o date.out.wh date submitted as jobid 3224233
In this case the jobid is 3224233.
While the job is queued / running or has recently finished
This value will also be listed in the jobid column when one runs sqjobs, eg.
[merz@wha781 ~]$ sqjobs jobid queue state ncpus prio nodes time command ------- ----- ----- ----- ------- ----- ---- ------- 3224233 test Q 1 333.312 5s date
After the job has finished
On requin (which runs LSF), one can find the jobid in the job output file (every job submitted with sqsub should use the -o flag to specify a job output file!). Looking at the file specified in the above sqsub command, date.out.wh for example:
[merz@req769 ~]$ cat job_output.requin.out | grep "Subject: Job" Subject: Job 3224233: <date> Done
On most other SHARCNET systems (which run Torque+Maui/Moab), one can find the jobid in the job output file as well:
[merz@orc-login1 ~]$ cat job_output.orca.out | grep "job id" job id: 3224233
Web Portal Jobs Database
As a job progresses from waiting in the queue, through to running and on to completion, information about the job is logged to the SHARCNET job database.
You can look at the information associated with any of your jobs by visiting your web portal activity page. This page is helpful if you can't remember the details about a job and have lost or deleted the output file. The jobs are currently (Oct 2013) listed in a tabular format at the bottom of the activity page as follows:
An example of the jobs listing which can be found in a user's activity page in the web portal.
Note: you may have to change the configuration to see the details you are interested in
By clicking on the values in the Job ID column in the jobs table one can access a job summary page that presents much of the same information provided by the bhist and qstat commands below.
Getting the system's view of a job
If a job is queued, running or recently completed you should be able to query the job scheduler on the system directly (either LSF or Torque/Moab) to get accurate and timely information about the state of your job.
If your job completed more than a couple of days ago then it was likely flushed out of the system's records and can only be found via the jobs database in the web portal.
If you don't know the job id you can look through your job listing in the webportal to find it.
To find out information about a job on a particular system running the LSF job scheduler one can use the bhist command. For example:
[merz@wha781 ~]$ bhist -l 3224233 Job <3224233>, User <merz>, Project <600>, Command <date> Thu May 27 10:19:01: Submitted from host <wha781>, to Queue <test>, CWD <$HOME> , Output File <date.out.wh>; RUNLIMIT 10.0 min of wha781 Thu May 27 10:19:06: Dispatched to <wha75>; Thu May 27 10:19:06: Starting (Pid 14217); Thu May 27 10:19:07: Running with execution home </home/merz>, Execution CWD </ home/merz>, Execution Pid <14217>; Thu May 27 10:19:07: Done successfully. The CPU time used is 0.0 seconds; Thu May 27 10:19:07: Post job process done successfully; Summary of time in seconds spent in various states by Thu May 27 10:19:07 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 5 0 1 0 0 0 6
You can see a number of things, including when it was submitted to the system, started, finished, exit states, compute nodes used, resource usage, etc. A job may be suspended periodically and that will also show up in this history.
On some systems (especially whale) the system's job log grows quickly and is turned over on a frequent basis. To find older jobs you may have to add the -n X flag to bhist, where X is an integer value, 10 or 30 usually suffices (the larger the number the longer it will take to complete as it searchs through more log files).
To find further details about a job running on a system that uses the Torque/Moab job scheduler one should use the qstat command, eg.
[merz@hnd50 ~]$ qstat -f 194242 Job Id: 194242.hnd51 Job_Name = date Job_Owner = merz@hnd50 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = test server = hnd51 Checkpoint = u ctime = Thu May 27 11:19:52 2010 Error_Path = hnd50:/home/merz/date.e194242 exec_host = hnd21/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = n mtime = Thu May 27 11:19:53 2010 Output_Path = hnd50:/home/merz/date.out.ho Priority = 0 qtime = Thu May 27 11:19:52 2010 Rerunable = False Resource_List.cput = 00:10:00 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.pvmem = 3072mb Resource_List.walltime = 00:10:00 session_id = 12358 Variable_List = PBS_O_HOME=/home/merz,PBS_O_LANG=C,PBS_O_LOGNAME=merz, PBS_O_PATH=~/bin/blast-2.2.21/bin/:/opt/sharcnet/vmd/current/bin:/opt <SNIP> This is a long list .... </SNIP> OMP_NUM_THREADS=1,PBS_O_QUEUE=test etime = Thu May 27 11:19:52 2010 exit_status = 1 submit_args = -V -r n -j oe -o /home/merz/date.out.ho -j oe -q test -N dat e -d /home/merz -l walltime=0:10:00 -l cput=0:10:00 -l pvmem=3072m -m n -l nodes=1 - start_time = Thu May 27 11:19:53 2010 start_count = 1 comp_time = Thu May 27 11:19:53 2010
This will show you all of the scheduler properties about the job (nodes used, resource limits, timestamps, etc.) as well as information concerning the execution environment of the job.
As with LSF, this data is turned over pretty frequently so your job may not be listed on the cluster, in which case you will only be able to find out information concerning the job in the jobs database via the web portal.
Inspecting running jobs
In order of increasing complexity; different ways to look at your job's underlying processes and the compute nodes it is using include:
- sqjobs -L command
- logging into the nodes directly and running diagnostic commands like ps and top
- looking at the ganglia plots for the nodes the job was running on at the given time
By running sqjobs with the -L option one can get a process listing of all the processes involved with their job in the typical ps output format. For example:
[merz@wha780 merz]$ sqsub -t -q serial -r 10m -o mafft_test.1 mafft --auto /home/merz/input_sequences submitted as jobid 3229987 <wait for job to start...> [merz@wha780 merz]$ sqjobs -L jobid hostid pid state resident virtual %cpu command ------- ------ ----- ----- -------- ------- ---- ------------------------------- 3229987 9 19728 R 478916 592436 99.4 ~merz/lib/mafft/disttbfast -b 6 tot_rss 467.7M tot_vsz 578.6M avg_pcpu 97.6% cur_pcpu 99.4% jobid queue state ncpus prio nodes time command ------- ----- ----- ----- ------- ----- ---- ----------------------------------- 3229987 test R 1 166.495 wha9 39s mafft --auto /home/merz/input_seque 2712 CPUs total, 16 idle, 2696 busy; 2149 jobs running; 0 suspended, 1533 queued. [merz@wha780 merz]$
Only 1 serial job is running; it is using ~500MB of memory and 99.4% of a CPU on wha9.
For MPI jobs the process listing will include all processes participating in the job.
logging into compute nodes
One can also log directly into the compute nodes to run ps and top to see the job execute in realtime. We do not recommend this and you should refrain from running anything other than simple diagnostic commands on the compute nodes. One can find the list of nodes participating in the job via the methods described above.
SHARCNET is currently in the process of bringing the Ganglia monitoring system online. If you know the time period during which your job was running and the nodes it was running on you can inspect a wide variety of performance metrics that were collected while your job was executing. The main entry point for Ganglia is http://ganglia.sharcnet.ca. Once there, you can click through to look at timeline plots for particular nodes by clicking on a particular cluster and then clicking on the node. For example, plots for kraken/narwhal node nar111 for the last hour can be found here.
One thing to keep in mind about Ganglia data is that it only captures information at the node level. If your job was sharing the node with other jobs (which is common for serial jobs) then you will be looking at the aggregate performance of all jobs on the node at that time (to avoid this request a full node for your job).