From Documentation
Revision as of 15:15, 8 February 2019 by Syam (Talk | contribs) (How many jobs can I submit in one cluster?)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Compiling and Running Programs

For information about compiling your programs on orca, graham and other national Compute Canada systems, please see the Installing software in your home directory page on Compute Canada wiki.

For information about how to compile on older SHARCNET systems, see Legacy Systems.

How do I run a program interactively?

For running interactive jobs on graham and other national systems, see Running jobs page on Compute Canada wiki.

If trying interactive jobs on legacy systems, see Legacy Systems.

My application runs on Windows, can I run it on SHARCNET?

It depends. If your application is written in a high level language such as C, C++ and Fortran and is system independent (meaning it does not depend on any particular third party libraries that are available only for Windows), then you should be able to recompile and run your application on SHARCNET systems. However, if your application completely depends upon a special software for Windows, it will not run on the Linux compute nodes. In general it is impossible to convert code at binary level between Windows and any of UNIX platforms. For options relating to running Windows in virtual machines there is a Creating a Windows VM page at the Compute Canada Wiki.

My application runs on Windows HPC clusters, can I run it on SHARCNET clusters?

If your application does not use any Windows specific APIs then it should be able to recompile and run on SHARCNET UNIX/Linux based clusters.

My program needs to run for more than seven (7) days; what can I do?

The seven day run-time limit on legacy systems cannot be exceeded. This is done to primarily encourage the practice of checkpointing, but it also prevents users from monopolizing large amounts of resources outside of dedicated allocations with long running jobs, ensures that jobs free up nodes often enough for the scheduler to start large jobs in a modest amount of time, and allows us to drain all systems for maintenance within a reasonable time-frame.

In order to run a program that requires more than this amount of wall-clock time, you will have to make use of a checkpoint/restart mechanism so that the program can periodically save its state and be resubmitted to the queues, picking up from where it left off. It is crucial to store checkpoints so that one can avoid lengthy delays in obtaining results in the event of a failure. Investing time in testing and ensuring that one's checkpoint/resume works properly is inconvenient but ensures that valuable time and electricity are not wasted unduly in the long run. Redoing a long calculation is expensive.

Although it is encourage to always use checkpointing for log running work loads, there are a small number of nodes available for 28 day run times on the national general purpose systems Graham and Cedar.

Handling long jobs with chained job submission

On systems that use the Slurm scheduler (e.g. Orca and Graham) job dependencies can be implemented such that the start of one job can be contingent on the completion of another job. This job contingency is expressed via the dependency optional input to sbatch expressed as follows in the job submit script:

    dependency=afterok:<jobid>

Other strategies for resubmitting jobs for long running computations on the Slurm scheduled systems are described on the Compute Canada Wiki.

How do I checkpoint/restart my program?

Checkpointing is a valuable strategy that minimizes the loss of valuable compute time should a long running job be unexpectedly killed by a power outage, node failure, or hitting its runtime limit. On the national systems checkpointing can be accomplished manually by creating and loading your own custom checkpoint files or by using the Distributed MultiThreaded CheckPointing (DMTCP) software without having to recompile your program. For further documentation of the checkpointing and DMTCP software see the Checkpoints page at the Compute Canada Wiki site.

If your program is MPI based (or any other type of program requiring a specialized job starter to get it running), it will have to be coded specifically to save state and restart from that state on its own. Please check the documentation that accompanies any software you are using to see what support it has for checkpointing. If the code has been written from scratch, you will need to build checkpointing functionality into it yourself---output all relevant parameters and state such that the program can be subsequently restarted, reading in those saved values and picking up where it left off.

How can I know when my job would start?

The Slurm scheduler can report expected start times for queued jobs as output from the squeue command. For example the follow command returns the current jobs for user 'username' with columns for job id, job name, start time (N/A if there is no estimate), and job state:

$ squeue -u username -o "%.10i%.24j%.12T%.24S%.24R"
    JOBID                    NAME       STATE              START_TIME        NODELIST(REASON)
 12345678                  mpi.sh     PENDING                     N/A              (Priority)

It is important to note that the estimated start time listed in the START_TIME column (if available) can change substantially over time. This start time estimate is based on the current state of the compute nodes and list of jobs in the queue. Because the state of the compute nodes and list of jobs in the queue are constantly changing the start time estimates for pending jobs can change for several reasons (running jobs end sooner than expected, higher priority jobs enter the queue, etc). For more information regarding the variables that affect wait times in the queue see the job scheduling policy page at the Compute Canada Wiki site.

Is package X preinstalled on system Y, and, if so, how do I run it?

The software packages that are installed and maintained on the national systems are listed at the available software page of the Compute Canada Wiki site. Some packages have specific documentation for running on the national systems. For the packages that have specific Compute Canada instructions follow the link in the 'Documentation' column of the list of globally installed modules table.

For legacy SHARCNET systems the list of preinstalled packages (with running instructions) can be found on the SHARCNET software page.

Command 'top' gives me two different memory size (virt, res). What is the difference between 'virtual' and 'real' memory?

'virt' refers to the total virtual address space of the process, including virtual space that has been allocated but never actually instantiated, including memory which was instantiated but has been swapped out, and memory which may be shared. 'res' is memory which is actually resident - that is, instantiated with real ram pages. resident memory is normally the more meaningful value, since it may be judged relative to the memory available on the node. (recognizing, of course, that the memory on a node must be divided among the resident pages for all the processes, so an individual thread must always strive to keep its working set a little smaller than the node's total memory divided by the number of processors.)

there are two cases where the virtual address space size is significant. one is when the process is thrashing - that is, has a working set size bigger than available memory. such a process will spend a lot of time in 'D' state, since it's waiting for pages to be swapped in or out. a node on which this is happening will have a substantial paging rate expressed in the 'si' column of output from vmstat (the 'so' column is normally less significant, since si/so do not necessarily balance.)

the second condition where virtual size matters is that the kernel does not implement RLIMIT_RSS, but does enforce RLIMIT_AS (virtual size). we intend to enforce a sanity-check RLIMIT_AS, and in some cases do. the goal is to avoid a node becoming unusable or crashing when a job uses too much memory. current settings are very conservative, though - 150% of physical memory.

in this particular case, the huge V size relative to R is almost certainly due to the way Silky implements MPI using shared memory. such memory is counted as part of every process involved, but obviously does not mean that N * 26.2 GB of ram is in use.

in this case, the real memory footprint of the MPI rank is 1.2 GB - if you ran the same code on another cluster which didn't have numalink shared memory, both resident and virtual sizes would be about that much. since most of our clusters have at least 2GB per core, this code could run comfortably on other clusters.

Can I use a script to compile and run programs?

Yes. For instance, suppose you have a number of source files main.f, sub1.f, sub2.f, ..., subN.f, to compile these source code to generate an executable myprog, it's likely that you will type the following command

ifort main.f sub1.f sub2.f ... sub N.f -llapack -o myprog 

Here, the -o option specifies the executable name myprog rather than the default a.out and the option -llapack at the end tells the compiler to link your program against the LAPACK library, if LAPACK routines are called in your program. If you have long list of files, typing the above command every time can be really annoying. You can instead put the command in a file, say, mycomp, then make mycomp executable by typing the following command

chmod +x mycomp

Then you can just type

./mycomp

at the command line to compile your program.

This is a simple way to minimize typing, but it may wind up recompiling code which has not changed. A widely used improvement, especially for larger/many source files, is to use make. make permits recompilation of only those source files which have changed since last compilation, minimizing the time spent waiting for the compiler. On the other hand, compilers will often produce faster code if they're given all the sources at once (as above).

I have a program that runs on my workstation, how can I have it run in parallel?

If the the program was written without parallelism in mind, then there is very little that you can do to run it automatically in parallel. Some compilers are able to translate some serial portion of a program , such as loops, into equivalent parallel code, which allows you to explore the potential architecture found mostly in symmetric multiprocessing (SMP) systems. Also, some libraries are able to use parallelism internally, without any change in the user's program. For this to work, your program needs to spend most of its time in the library, of course - the parallel library doesn't speed up your program itself. Examples of this include threaded linear algebra and FFT libraries.

However, to gain the true parallelism and scalability, you will need to either rewrite the code using the message passing interface (MPI) library or annotate your program using OpenMP directives. We will be happy to help you parallelize your code if you wish. (Note that OpenMP is inherently limited by the size of a single node or SMP machine - most SHARCNET resources

Also, the preceding answer pertains only to the idea of running a single program faster using parallelism. Often, you might want to run many different configurations of your program, differing only in a set of input parameters. This is common when doing Monte Carlo simulation, for instance. It's usually best to start out doing this as a series of independent serial jobs. It is possible to implement this kind of loosely-coupled parallelism using MPI, but often less efficient and more difficult.

Where can I find available resources?

The information about available computational resources are available to the public as follows:

Can I find my job submission history?

Yes, for SHARCNET maintained legacy systems, you may review the history by logging in to your web account.

For national Compute Canada systems and systems running Slurm, you can see your job submission history from a specific date YYYY-MM-DD by running the following command:

 sacct --starttime YYYY-MM-DD --format=User,JobID%15,Jobname%25,partition%25,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

where YYYY-MM-DD is replaced with the appropriate date.

How many jobs can I submit in one cluster?

Currently Graham has a limit of 1000 submitted jobs per user.

How are jobs scheduled?

Job scheduling is the mechanism which selects waiting jobs ("queued") to be started ("dispatched") on nodes in the cluster. On all of the major SHARCNET production clusters, resources are "exclusively" scheduled so that a job will have complete access to the CPUs, GPUs or memory that it is currently running on (it may be pre-empted during the course of it's execution, as noted below). Details as to how jobs are scheduled follow below.

How long will it take for my queued job to start?

On national Compute Canada systems and systems running with Slurm, you can see the estimated time your queued jobs will start by running:

 squeue --start -u USER

and replace USER with the name of the account that submitted the job.

What determines my job priority relative to other groups?

The priority of different jobs on the systems is ranked according to the usage by the entire group. This system is called Fairshare. More detail is available here.

Why did my job get suspended?

Sometimes your job may appear to be in a running state, yet nothing is happening and it isn't producing the expected output. In this case the job has probably been suspended to allow another job to run in it's place briefly.

Jobs are sometimes preempted (put into a suspended state) if another higher-priority job must be started. Normally, preemption happens only for "test" jobs, which are fairly short (always less than 1 hour). After being preempted, a job will be automatically resumed (and the intervening period is not counted as usage.)

On contributed systems, the PI who contributed equipment and their group have high-priority access and their jobs will preempt non-contributor jobs if there are no free processors.

My job cannot allocate memory

If you did not specify the amount of memory your job needs when you submitted the job, resubmit the job specifying the amount of memory it needs.

If you specifyed the amount of memory your job needed when it was submitted, then the memory requested was completely consumed. Resubmit your job with a larger memory request. (If this exceeds the available memory desired, then you will have to make your job use less memory.)

Some specific scheduling idiosyncrasies:

One problem with cluster scheduling is that for a typical mix of job types (serial, threaded, various-sized MPI), the scheduler will rarely accumulate enough free CPUs at once to start any larger job. When an job completes, it frees N cpus. If there's an N-cpu job queued (and of appropriate priority), it'll be run. Frequently, jobs smaller than N will start instead. This may still give 100% utilization, but each of those jobs will complete, probably at different times, effectively fragmenting the N into several smaller sets. Only a period of idleness (lack of queued smaller jobs) will allow enough cpus to collect to let larger jobs run.

Note that clusters enforce runtime limits - if the job is still running at the end of the stated limit, it will be terminated. Note also that when a job is suspended (preempted), this runtime clock stops: suspended time doesn't count, so it really is a limit on "time spent running", not elapsed/wallclock time.

How do I run the same command on multiple clusters simultaneously?

If you're using bash and can login with the SSH authentication agent connection forwarding enabled (the -A flag; ie. you've set up ssh keys; see Choosing_A_Password#Use_SSH_Keys_Instead.21 for a starting point) add the following environment variable and function to your ~/.bashrc shell configuration file:

~/.bashrc configuration: multiple cluster command
export SYSTEMS_I_NEED="graham.computecanada.ca orca.computecanada.ca"
 
function clusterExec {
  for clus in $SYSTEMS_I_NEED; do
     ping -q -w 1 $clus &> /dev/null
     if [ $? = "0" ]; then echo ">>> "$clus":"; echo ""; ssh $clus ". ~/.bashrc; $1"; else echo ">>> "$clus down; echo ""; fi
   done
}

You can select the relevant systems in the SYSTEMS_I_NEED environment variable.

To use this function, reset your shell environment (ie. log out and back in again), then run:

clusterExec uptime

You will see the uptime on the cluster login nodes, otherwise the cluster will appear down.

If you have old host keys (not sure why these should change...) then you'll have to clean out your ~/.ssh/known_hosts file and repopulate it with the new keys. If you suspect a problem contact an administrator for key validation or email help@sharcnet.ca. For more information see Knowledge_Base#SSH_tells_me_SOMEONE_IS_DOING_SOMETHING_NASTY.21.3F.

How do I load different modules on different clusters?

SHARCNET maintained systems provide the environment variables named:

  • $CLUSTER, which is the system's hostname (without sharcnet.ca or computecanada.ca), and
  • $CLU which will resolve to a three-character identifier that is unique for each system (typically the first three letters of the clusters name).

You can use these in your ~/.bashrc to load certain software on a particular system, but not others. For example, you can create a case statement in your ~/.bashrc shell configuration file based on the value of $CLUSTER:

~/.bashrc configuration: loading different modules on different systems
case $CLU in
  orc)
    # load 2014.6 Intel compiler...
    module unload intel
    module load intel/2014.6
  ;;
  gra)
    # load 2018.3 Intel compiler...
    module load intel/2018.3
  ;;
  *)
    # This runs if nothing else matched.
  ;;
esac