From Documentation
Jump to: navigation, search
Target usage: Low latency parallel jobs
System information: see orca system page in web portal
System status: see orca status page
Real time system data: see Ganglia monitoring page
Full list of SHARCNET systems

System Overview

Node Range 1-320 321-360 361-388 389-392
CPU Model AMD Opteron 2.2GHz (6174) Intel Xeon 2.6GHz (E5-2670) Intel Xeon 2.7GHz (E5-2680) Intel Xeon 2.7GHz (E5-2680)
Cores/node 24 16 16 16
Memory/node 32 GB 32 GB 64 GB 128 GB
Interconnect QDR InfiniBand, 2:1 blocking
/scratch Storage 492 TB
OS CentOS 6.4
Max. Jobs 5000

For system notice/history, please visit the Orca system page in the SHARCNET web portal.

System Access and User Environment

Login Nodes

Orca has a pair of login nodes that you are directed to when you ssh to These are intended to provide a stable point of access to the cluster. For any resource intensive interactive work please use the development nodes (see below).

processes are limited to 1 cpu hour and 2GiB of virtual memory on the login nodes.

Development nodes

There are 4 development nodes on orca: orc-dev1, orc-dev2, orc-dev3, orc-dev4

Programs are allowed to run on the development nodes for up to 3 cpu days. These are a convenient place to compile and test code.

Logging In

First we log into orca, then we log into a development node:

[snuser@localhost ~]$ ssh 

Welcome to the SHARCNET cluster Orca.... <snip!>

[snuser@orc-login2:~]$ ssh orc-dev1

Welcome to the SHARCNET cluster Orca. ... <snip!>


The development nodes have the same software environment as the login nodes - the same modules are available, and the same commands for compiling are used (cc, mpicc, f90, mpif90 etc.). Please refer to our Knowledge Base for more details.

If compiling a large code, using the /tmp directory will be faster. Remember to move files from /tmp to somewhere in /home or /work once done, as the /tmp directory is regularly cleaned out. The /tmp directory is local to each development node, in other words each node has its own /tmp not visible from other nodes.

Running a parallel job

Let's say you compiled an MPI parallel executable called test.x located in your current directory. You can compile this in the usual way. To run this with two processes, on development nodes orc-dev1,orc-dev2, you would execute:

`which mpirun` -n 2 -host orc-dev1,orc-dev2 ./test.x

This command should be launched from the development node.

Breaking this down:

  • `which mpirun`
    • specifies the full path to the mpirun binary you presently have loaded via a module. It will not be found on remote nodes due to the way remote shells are implemented at SHARCNET (no environment is set up).
  • -n 2
    • specifies how many MPI processes to start (round-robin distribution across nodes by default)
  • -host orc-dev1,orc-dev2
    • specifies which development nodes the job should use, as a comma separated list. One may also set up a hostfile, see man mpirun for examples and further MPI process layout possibilities.
  • ./test.x
    • the name of your program, again, using a fully specified path so that it is found on the remote nodes

Note: be careful to never specify in the -host option any other nodes except for the 4 development nodes.

Checking to see how busy the development nodes are

One can look at how much memory is allocated and how busy the node is with the free and uptime commands:

snuser@orc129:~/tests/test_mpi_hello] free
            total       used       free     shared    buffers     cached
Mem:      32958080     768048   32190032          0       1304     276276
-/+ buffers/cache:     490468   32467612
Swap:     31999988          0   31999988
[snuser@orc129:~/tests/test_mpi_hello] uptime
 16:03:33 up 5 days,  1:22,  1 user,  load average: 0.07, 0.02, 0.00

The free command shows a number of values (by default it reports kilobytes), the important one is the value in the "free" column listed on the row beginning with "-/+ buffers/cache" (at present 32467612). This is the amount of free memory, including memory that is temporarily set aside for buffers and caching and which can be evicted (in other words, it is a better measure of how much is available for processes to use). If this value is significantly less than the value listed in the "total" column for the "Mem:" row, then the node is using a lot of memory and may not have sufficient memory for your purposes. To inspect the free memory on all of the development nodes, one can run:

 pdsh -w orc-dev[1-4] free | grep 'buffers\/' | awk '{print $1,$NF}' | sort -n -r -k 2
[isaac@orc129:~] pdsh -w orc-dev[1-4] free | grep 'buffers\/' | awk '{print $1,$NF}' | sort -n -r -k 2
orc-dev1: 31262240
orc-dev2: 30182316
orc-dev3: 30013940
orc-dev4: 29457716

The uptime command shows a number of values, the important ones are the "load average:" numbers. These numbers show how many processes are in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. If these numbers are close to, or more than, the number of processing cores (4 in the case of the present kraken development nodes), then you should probably pick a different node to work on. To inspect the 15 minute load average on all the development nodes, one can run the following command, and then pick the node with the least load to use:

 pdsh -w orc-dev[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2
[isaac@orc129:~] pdsh -w orc-dev[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2
orc-dev3: 0.00
orc-dev4: 0.00
orc-dev1: 0.01
orc-dev2: 0.03

Debugging with DDT

On Orca, test queue is currently disabled. We are trying to fix it, but for now DDT should only be run on orca development nodes. The following instructions are for the newest (default) ddt module on orca:

  • Login to orca, then login to one of the four development nodes: orc-dev1, orc-dev2, orc-dev3, orc-dev4. Check first if the node is too busy, by running “top” command.
  • Load ddt module:
module load ddt
  • If you never ran DDT on orca after orca was upgraded, you should delete the ddt settings file:
rm -R ~/.ddt2
  • Now you can launch DDT interactively, in the usual manner:
ddt  path_to_executable  program_arguments


Each node contains anywhere from 100 to 400 GB of local storage which can be accessed as /tmp . The orca /scratch filesystem (which one would access as /scratch/$USER ) is 492TB in size.

Submitting Jobs

In general users will experience the best performance on orca by ensuring that their jobs use whole nodes. Some measurements have shown that when MPI jobs are sharing nodes with other jobs that they slow down depending on resource contention. Please remember that the tradeoff involved when requesting whole nodes is that jobs typically wait longer in the queue, since it takes longer to find whole nodes free of other processes.

On the main part of orca (nodes 1 to 320), to use whole nodes threaded jobs should use 24 cores and MPI jobs should use multiples of 24 cores. Threaded jobs that do not scale well to 24 cores should probably be run on saw, hound or kraken instead of orca.

When submitting MPI jobs that are network-sensitive, one should use the -N and -n flags to ensure a job is schedule to full nodes. For example, if your program is going to use 240 processes, one would submit it as:

sqsub -q mpi -n 240 -N 10 <...>

It is important to include -N 10 to ensure the job is not scattered on nodes where other user's jobs are running. Equivalently, "-n 240 --ppn 24" can be used.

If your job is not communication sensitive then you can submit it without the -N flag and it will typically start faster as it can run in gaps on nodes that are partially full.

On xeon nodes (320+) the same argument applies except that there are 32 cores instead of 24.


A first expansion was added to orca consisting of 40 nodes with Intel Sandy Bridge architecture Xeon processors (E5-2670). These are nodes orc321-360 and each has 16 cores and 32 GB of memory.

A second expansion followed with 32 nodes, also with Sandy Bridge architecture Xeon processors (E5-2680). Of these, nodes orc361-388 have 64 GB memory, while nodes orc389-392 have 128 GB memory. This second expansion is contributed hardware and jobs submitted via kglamb queue will have a higher priority. Further, jobs submitted by regular users can only indicate a runtime limit of 4 hours or less ( sqsub -r 4h ... ) to be eligible for these nodes (as opposed to the standard runtime limit of 7 days).

Because these nodes have different performance than older Opteron nodes, MPI jobs will not be allowed to span across nodes. If you want to submit a job to run on older nodes, the sqsub flag would be:

sqsub -q mpi -f opteron

To submit a job to the new Sandy Bridge nodes, you would use flag:

sqsub -q mpi -f xeon

A job submitted without specifying the flag, with just:

sqsub -q mpi

will go to whichever node set is available first.

NOTE: Some jobs are likely to run much faster on the newer Sandy Bridge (xeon) nodes. Please keep note of that fact when doing runtime estimates. If submitting without specifying the -f flag, please specify a high enough runtime estimate so that a job will have time to complete when running on slower nodes.