From Documentation
Jump to: navigation, search
Note: This page's content only applies to the SHARCNET's legacy systems; it does not apply to Graham.
kraken
Hostname: kraken.sharcnet.ca
Target usage: serial and/or small threaded jobs
System information: see kraken system page in web portal
System status: see kraken status page
Real time system data: see Ganglia monitoring page
Full list of SHARCNET systems



System Overview

Please Note: Kraken is primarily intended to run serial or small threaded programs. While one can run MPI programs on kraken, it is not recommended due to it being heterogeneous. If you do wish to run MPI jobs on kraken please ensure your environment is set correctly; see below: Network / MPI job considerations.

Spec Info Remarks
Cores (CPU) AMD Opteron 2.2GHz (275) AMD Opteron 2.4GHz(bull 1-96)
Cores/node 4
Memory/node 8 GB 32 GB (bull 1-96)
Interconnect Myrinet 2g (gm) (only on the narwhal nodes) and Gigabit Ethernet
OS CentOS 6
Max. Jobs 5000

For system notice/history, please visit the Kraken system page in the SHARCNET web portal.

System Access and User Environment

Login Nodes

[isaac@bul130:~] ulimit -aH
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 73728
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) 256000
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) 86400
max user processes              (-u) 100
virtual memory          (kbytes, -v) 3000000
file locks                      (-x) unlimited

Please note that login nodes are only to be used for short computations that do not require a lot of resources. To ensure this, some of the resource limits on login nodes have been set to low values. If you want to see your limits, please execute:

ulimit -a

In order to change your limits, please do:

ulimit -v 2000000

which sets the virtual memory to 2GB.

Development nodes

# of development nodes ==> 5
hostnames ==> kraken-devel1, kraken-devel2,  ... kraken-devel5
Job submission scheduler ==> N/A
Max. running time ==> 1 cpu day

Logging In

First we log into kraken, then we log into a development node:

[snuser@localhost ~]$ ssh kraken.sharcnet.ca 

Welcome to Kraken, the SHARCNET amalgamated throughput/MPI cluster. <snip!>

[snuser@bul130:~]$ ssh kraken-devel1

Welcome to Kraken, the SHARCNET amalgamated throughput/MPI cluster. <snip!>

Compiling

We compile on the development node in exactly the same way we compile on the login node. The same modules are available, and the same commands for compiling are used (cc, mpicc, f90, mpif90 etc.). Please refer to our Knowledge Base for more details.

If compiling a large code, doing it in the /tmp directory will make it go much faster. Just remember to move files from /tmp to somewhere in /home or /work once done, as the /tmp directory is regularly cleaned out. The /tmp directory is local to each development node, in other words each node has its own /tmp not visible from other nodes.

Running an MPI program on the development nodes

Let's say you compiled an MPI parallel executable called test.x located in your current directory. You can compile this in the usual way. To run this with two processes, on development nodes kraken-devel1, kraken-devel2, you would execute:

[snuser@bul132 ~]$ `which mpirun` -n 2 -host kraken-devel1,kraken-devel2 -mca btl tcp,sm,self ./test.x
bul132
bul133
           2           0
           2           1

This command should be launched from the development node.

Breaking this down:

  • `which mpirun`
    • specifies the full path to the mpirun binary you presently have loaded via a module. It will not be found on remote nodes due to the way remote shells are implemented at SHARCNET (no environment is set up).
  • -n 2
    • specifies how many MPI processes to start (round-robin distribution across nodes by default)
  • -host kraken-devel1,kraken-devel2
    • specifies which development nodes the job should use, as a comma seperated list. One may also set up a hostfile, see man mpirun for examples and further MPI process layout possibilities.
  • -mca btl tcp,sm,self
    • tells Open MPI to use only the TCP (ethernet), shared memory and self byte transport layers. At present the devel nodes only have gigabit ethernet, not myrinet, and produce a spurious MX warning if one uses the default btls, so this option is only included to avoid the warning.
  • ./test.x
    • the name of your program, again, using a fully specified path so that it is found on the remote nodes

Checking to see how busy the development nodes are

One can look at how much memory is allocated and how busy the node is with the free and uptime commands:

[snuser@bul132:~] free
             total       used       free     shared    buffers     cached
Mem:       8174704     596988    7577716          0     232432     223548
-/+ buffers/cache:     141008    8033696
Swap:      8016120          0    8016120
[snuser@bul132:~] uptime
11:18:27 up 10 days, 19:18,  2 users,  load average: 0.08, 0.06, 0.01

The free command shows a number of values, the important one is the value in the "free" column listed on the row beginning with "-/+ buffers/cache" (at present 32467612). This is the amount of free memory, including memory that is temporarily set aside for buffers and caching and which can be evicted (in other words, it is a better measure of how much is available for processes to use). If this value is significantly less than the value listed in the "total" column for the "Mem:" row, then the node is using a lot of memory and may not have sufficient memory for your purposes. To inspect the free memory on all of the development nodes, one can run:

 pdsh -w kraken-devel[1-8] free | grep 'buffers\/' | awk '{print $1,$NF}' | sort -n -r -k 2
[snuser@bul130:~] pdsh -w kraken-devel[1] free | grep 'buffers\/' | awk '{print $1,$NF}' | sort -n -r -k 2
kraken-devel1: 8043824

The uptime command shows a number of values, the important ones are the "load average:" numbers. These numbers show how many processes are in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. If these numbers are close to, or more than, the number of processing cores (4 in the case of the present kraken development nodes), then you should probably pick a different node to work on. To inspect the 15 minute load average on all the development nodes, one can run the following command, and then pick the node with the least load to use:

 pdsh -w kraken-devel[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2
[snuser@bul130:~] pdsh -w kraken-devel[1] uptime | awk '{print $1,$NF}' | sort -n -k 2
kraken-devel1: 0.00

Storage

Kraken no longer has /scratch storage, all of it's constituent /scratch filesystems were decommissioned in Nov 2013. Users should use /work for job input/output on kraken.

Submitting Jobs

Memory

kraken nodes have different amounts of memory per core.

  • the default memory allocation per job process is 1GB
    • Users must change this with the sqsub --mpp flag if they'd like more memory per process for their jobs
    • Beware: MPI jobs are scheduled to the myrinet subclusters by default, if you specify more than 8GB of memory per process your MPI job will never start unless you also specify '-f eth' so that the job will be assigned to the bull subcluster

Network / MPI job considerations

Different parts of kraken (the old sub-clusters) have isolated HPC networks. The prior PoPs and Bull are only connected via gigabit ethernet, while Narwhal has a high-performance myrinet connection. In general we do not recommend running production MPI jobs on kraken - the newer clusters are better suited for this work. One may wish to do development or run small MPI jobs on kraken, though.

MPI jobs that are network-sensitive should be compiled and run according to the following:

  1. If you are using the default software modules, you must switch to the newer version of OpenMPI 1.6.2 so that your program can take advantage of the Myrinet network on Narwhal:
    • module unload openmpi/intel/1.6.2; module load openmpi/intel/1.6.2_1
    • failing to do so will result in your program using the slower ethernet network
  2. When submitting your job, you must specify the following sqsub options:
    • sqsub -q mpi -f myri ...
    • this will ensure that your job is scheduled to run on nodes within a Myrinet-connected sub-cluster (only Narwhal remains). This should be used for all network-sensitive MPI jobs.
    • alternatively, if you request "-f eth", your job will only run on non-narwhal nodes (the prior PoP clusters and Bull).

Sub-clusters

Kraken is composed of older, smaller sub-clusters, and the nodes are named for those older sub-clusters. To request nodes from a particular sub-cluster, use the first three letters of the sub-cluster name (see below table). For example, to request Narwhal nodes, use "-f nar". This is normally not required as there is no longer local storage for each of the sub-clusters, and jobs will be dispatched according to their resource request (ie. big memory jobs will go to the old bull subcluster, jost requesting myrinet will go to the old narwhal subcluster).

The following old clusters are now sub-clusters of Kraken:

subcluster subcluster prefix subcluster type nodes cores/node memory/node (GB) interconnect
Tiger bul (used to be "tig") PoP 32 4 8 ethernet
Megaladon bul (used to be "meg") PoP 32 4 8 ethernet
Bruce bul (used to be "bru") PoP 32 4 8 ethernet
Bala bul (used to be "bal" PoP 32 4 8 ethernet
Zebra bul (used to be "zeb") PoP 32 4 8 ethernet
Bull bul big memory 96 4 32 ethernet
Narwhal nar utility 267 4 8 myrinet

The Dolphin sub-cluster ("dol" nodes) was decommisioned.