From Documentation
Jump to: navigation, search

What is SHARCNET?

SHARCNET is a consortium of Canadian academic institutions who share a network of high performance computers. With this infrastructure we enable world-class academic research.

  • Member of Compute Canada
  • 18 institutions across Ontario
  • Over 3,000 Canadian and international users
  • Over 20,000 processors
  • 150 Tesla GPUs
  • 10 Gb links between compute facilities
  • Single name sign-up, same home directory everywhere
  • Free to use
SN mapN.jpg

Suitable uses

Compute-Intensive Problems

  • The resources are provided to enable HPC and are not intended as a replacement for a researcher's desktop or lab machines
  • SHARCNET users can productively conduct HPC research on a variety of SHARCNET systems each optimally designed for specific HPC tasks

Academic HPC Research

  • The research can be business-related, but must be done in collaboration with an academic researcher

Fairness Access

  • Users have access to all systems
  • Clusters are designed for certain type of jobs
  • Job runs in batch mode (scheduling system) with fairshare

Facilities and resources

Computational resources

Resources Equipment Intended Use
Computational systems
  • capacity clusters (thousands of CPUs for serial/threaded processing)
  • capability clusters (fast interconnects for demanding MPI )
  • SMPs (Altix systems and large-memory server clusters)
  • accelerated clusters and systems (Cell, GPU, FPGA)
computational research
Visualization systems
  • parallel rendering cluster
  • visualization stations
  • projection systems
data visualization

Facilities: intended use

This listing only includes conventional clusters that are intended for production work with serial/threaded/MPI applications. For an expanded listing of all systems (including specialty clusters) see Hardware/System_Resources. An explanation of which system to use depending on your application requirements can be found here.

Cluster CPUs RAM/node Interconnect Intended Use
orca (Capability) 8320 32 GB InfiniBand large MPI, threaded(n<=24), devel nodes
saw (Capability) 2688 16 GB InfiniBand medium/large MPI, threaded(n<=8), devel nodes
requin (Capability) 1536 8 GB Quadrics medium/large MPI, test queue
kraken 2232 4,8,32 GB Myrinet 2g(GM) single farming, small/medium MPI/threaded, devel nodes
monk 432 48 GB InfiniBand GPGPU (108 GPUs)
hound 496 128 GB InfiniBand large memory, threaded(n<=32)

Online resources

  • SHARCNET’s web site provides extensive information about deployed systems, software stack and help.
SN webfront.jpg

Computing environment

Environment Description
Systems "Clusters" "SMP"s, GPGPU, Visualization
Operating systems 64-bit Linux (mainly CentOS, HP XC, Fedora)
Languages Fortran, C/C++, Java, Matlab, etc.
Unified compilation environment cc, c++, f77/90, mpicc, mpic++, mpif77, mpif90
Key parallel development support MPI, POSIX threads, OpenMP, CUDA, OpenACC
Software modules select software (and versions) with the 'module' command
Batch scheduling sq: SHARCNET unified batch execution environment

Account and login

Getting an account

Accessing SHARCNET computational systems

Your login credentials are used to access both our web portal and all of our computational systems

  • UNIX-based way:
    • Login to a system via SSH, you see familiar UNIX environment.
    • Edit source code and/or change the input data/configuration file(s).
    • Compile source code.
    • Submit a program (or many) to batch queuing system.
    • Check results later.

You may find more details SSH for Linux and Mac users.

  • MS-Windows based way:
    • Start your client applications such as putty, winscp and mobaXterm from your desktop.
    • Computational tasks and data are bundled as meta data and are sent to the remote compute engines under the hood. Tasks (aka jobs) are scheduled. During this period, your application is waiting…
    • Once the computational tasks are complete, results are sent back using your application.

You may find more details SSH for MS-Windows users.

Compiling programs

SHARCNET provides a unified compiling environment that chooses the right underlying compiler, options and libraries for you! Use them always unless you know better.

Command Language Extension Example
cc C c cc code.c -o code.exe
CC,c++,cxx C++ .C, .cc, .cpp, .cxx, c++ CC code.cpp -o code.exe
f77 Fortran 77 .f, .F .f77 f77 Fcode.f -o Fcode.exe
f90/95 Fortran 90/95 .f90, .f95, .F90, .F95 f90 fcode.f90 -o code.exe
mpicc C c mpicc mpicode.c -o mpicode.exe
mpiCC C++ C, .cc, .cpp, .cxx, c++ mpiCC mpicode.cc -o mpicode.exe
mpif77 Fortran 77 .f .F .f77 mpif77 mpicode.f -o mpicode.exe
mpif90/mpif95 Fortran 90/95 .f90, .f95, .F90, .F95 mpif90 mpicode.f90 -o mpicode.exe

You may take a look at Getting Started with Compiling Code on SHARCNET for more details.

File system basics

Pool Quota Expiry Access Purpose
/home 10 GB none unified Source, small configuration files, backed up regularly.
/work 1 TB none unified Active data files. auto mounted, convenient
/scratch None 2 months per-cluster temporary files, checkpoints, best performance
/tmp None post-job per-node Node-local scratch, caching
/freezer/$USER 2 TB 2 years login nodes long term data archive

User certification

Level Number of CPUs Maximum Runtime Limit
0 8 24 hours
1 256 168 hours

Note: If you need resources beyond the limit, contact us to see what options are available (e.g. NRAC).

Scheduling Jobs

Running jobs

  • Log on to the desired system, then:
    • Ensure files are in /scratch/$USER or /work/$USER (Do not run a job out of /home)
    • Jobs are submitted using the sqsub command:
  sqsub –q queue_name -r run_time [ additional_sq_options ] your_program [ your_args ]

You may find more details in Compiling and Running Programs page

Choosing the right queue

Job Type Queue Name CPUs Nodes
Parallel mpi >2 single or multiple nodes
Parallel threaded (system dependent) 8 (saw), 24 (orca), 16 or 32 (hound) single node
Serial serial 1 single node
GPU gpu 1 or more single or multiple nodes

Note: The gpu queue is only available on systems that have GPUs installed. For a list of those, plus instructions on how to submit to the gpu queue, see GPU Accelerated Computing

You may notice other queues not listed above, such as staff, gaussian, etc. These are special queues with restrictions and are not publicly available.

Using sq commands

In the following set of examples, the first example in each section submits the job to the test queue, the second specifies an output file, and the third specifies both an output and input file:

  • Submitting serial jobs
 sqsub -r 7d -q serial --test ./simula
 sqsub -r 7d -q serial -o simula.out ./simula
 sqsub -r 7d -q serial -o simula.out -i simula.in ./simula
  • Submitting parallel jobs (in this case, a 24 cpu job)
 sqsub -r 7d -q mpi --test -n 24 ./simula
 sqsub -r 7d -q mpi -n 24 -o simula.out ./simula
 sqsub -r 7d -q mpi -n 24 -o simula.out -i simula.in ./simula
  • Submitting a parallel job with request on process distribution (24 cpus on 6 nodes)
 sqsub -r 7d -q mpi --test -n 24 -N 6 ./simula
 sqsub -r 7d -q mpi -n 24 -N 6 -o simula.out ./simula
 sqsub -r 7d -q mpi -n 24 -N 6 -o simula.out -i simula.in ./simula

Useful commands

  • sqjobs - list the status of submitted jobs
 [isaac@nar316 ~]$ sqjobs
  jobid user queue state ncpus time command      
 ------ ---- ----- ----- ----- ---- -------------
 136671  isaac   mpi     R    20   0s ./my_mpi_prog
 1060 CPUs total, 30 idle, 1030 busy; 43 jobs running; 16 suspended, 12 queued.
  • sqkill - kill a job in the queue that you want to stop
 [isaac@nar316 ~]$ sqjobs
  jobid user queue state ncpus time command      
 ------ ---- ----- ----- ----- ---- -------------
 136672  isaac   mpi     Q     1   0s ./my_mpi_prog
 1060 CPUs total, 50 idle, 1010 busy; 42 jobs running; 16 suspended, 13 queued.
 [isaac@nar316 ~]$ sqkill 136672
 Job <136672> is being terminated

You may find more details about how to monitor your jobs in queue in Monitoring_Jobs page.

Common mistakes to avoid

  • Do not run significant programs on login nodes or directly on compute nodes
  • Do not specify a maximum job run time of 7 days, or more memory than required for your job
  • Do not run unoptimzed code.
  • Do not create millions of tiny files, or large amounts(>GB) of uncompressed (e.g. ASCII) output.

Support

SHARCNET Personnel Listing and Contact Information

People

  • HPC Consultants
    • User-facing point of contact
    • Analysis of research requirements
    • Development support and job performance analysis
    • Training and education
    • Project and technical computing consultations
  • System Administrators
    • User accounts.
    • System software.
    • Hardware and software maintenance.

Where to look for information

Problem ticket system

Use Problem Ticket in the web portal. To submit a problem ticket click here or simply email to help@sharcnet.ca.

SN helppage.jpg