From Documentation
Jump to: navigation, search
(What if I do not want a core dump: core dump file size is set to zero on new clusters, so this section is no longer relevant)
(What packages are available?)
Line 81: Line 81:
 
=== What packages are available? ===
 
=== What packages are available? ===
  
Various packages have been installed on SHARCNET clusters at users' requests. Custom installed packages include, for example, [http://www.gaussian.com/ Gaussian], [http://www.mcs.anl.gov/petsc/petsc-2/ PETSc], [http://www.r-project.org/ R], [http://www.msg.chem.iastate.edu/gamess/ Gamess], [http://www.ks.uiuc.edu/Research/vmd/ VMD], and [http://www.maplesoft.com/ Maple], and many others. Please check the SHARCNET [http://www.sharcnet.ca web portal] for the [https://www.sharcnet.ca/my/software software packages] installed and related usage information.
+
Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki [https://docs.computecanada.ca/wiki/Modules (link)].  If you do not see a package that you need on this list, please request it by submitting a problem ticket.
 +
 
 +
You can also search this wiki or the [https://docs.computecanada.ca/wiki/Compute_Canada_Documentation Compute Canada wiki] for the package you are interested in to see if there is any additional information about it available.
 +
 
 +
We can also help you with compiling/installing a package into your own file space if you prefer that approach.
  
 
=== What interconnects are used on SHARCNET clusters? ===
 
=== What interconnects are used on SHARCNET clusters? ===

Revision as of 14:39, 3 October 2018

Note: Some of the information on this page is for our legacy systems only. The page is scheduled for an update to make it applicable to Graham.

Programming and Debugging

What is MPI?

MPI stands for Message Passing Interface, a standard for writing portable parallel programs which is well-accepted in the scientific computing community. MPI is implemented as a library of subroutines which is layered on top of a network interface. The MPI standard has provided both C/C++ and Fortran interfaces so all of these languages can use MPI. There are several MPI implementations, including OpenMPI and MPICH. Specific high-performance interconnect vendors also provide their own libraries - usually a version of MPICH layered on an interconnect-specific hardware library.

For an MPI tutorial refer to the MPI tutorial.

In addition to C/C++ and Fortran versions of MPI, there exist other language bindings as well. If you have any special needs, please contact us.

What is OpenMP?

OpenMP is a standard for programming shared memory systems using threads with compiler directives instrumented in the source code. It provides a higher-level approach to utilizing multiple processors within a single machine while keeping the structure of the source code as close to the conventional form as possible. OpenMP is much easier to use than the alternative (Pthreads) and thus is suitable for adding modest amounts of parallelism to pre-exiting code. Because OpenMP is a set of programs, your code can still be compiled by a serial compiler and should still behave the same.

OpenMP for C/C++ and Fortran are supported by many compilers, including the PathScale and PGI for Opterons, and the Intel compilers for IA32 and IA64 (such as SGI's Altix.). OpenMP support has been provided in the GNU compiler suite since v4.2 (OpenMP 2.5), and starting with v4.4 supports the OpenMP 3.0 standard.

How do I run an OpenMP program with multiple threads?

An OpenMP program uses a single process with multiple threads rather than multiple processes. On SMP systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use. To run an OpenMP program foo that uses four threads with sqsub command, use the following

sqsub -q threaded -n 4 -r 5m ./foo

The option -n 4 specifies to reserve 4 CPUs per process.

For a basic OpenMP tutorial refer to OpenMP tutorial.

How do I measure the cpu time when running multi-threaded job?

It appears the easiest solution for you to use a simple benchmarking script, call it "time.sh" (should be made executable, "chmod u+x time.sh"):

#!/bin/bash
/usr/bin/time $*

You should place it in the same directory as you code binary, and insert "./time.sh" before the binary name in you sqsub command, e.g.

sqsub -q threaded -n8 -r2m -o out2 ./time.sh ./code code_arguments ...

I just tested it with a simple threaded application, and it does work with multiple threads. You'll get the output like this:

494.62user 0.98system 1:02.23elapsed 796%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+515minor)pagefaults 0swaps

As you can see, CPU cycles from all 8 threads were counted (796%CPU).

What mathematics libraries are available?

Every system has the basic linear algebra libraries BLAS and LAPACK installed. Normally, these interfaces are contained in vendor-tuned libraries. On Intel-based (Xeon) clusters it's probably best to use the Intel math kernel library (MKL). On Opteron-based clusters, AMD's ACML library is available. However, either library will work reasonably well on both types of systems. If one expects to do a large amount of computation, it is generally advisable to benchmark both libraries so that one selects the one offering best performance for a given problem and system.

One may also find the GNU scientific library (GSL) useful to some point for their particular needs. The GNU scientific library is an optional package, available on any machine.

For a detailed list of libraries on each clusters, please check the documentation on the corresponding SHARCNET satellite web sites

How do I use mathematics libraries such as BLAS and LAPACK routines?

First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an N by N real non symmetric matrix in double precision, you find the LAPACK routine DGEEV will do that. All you need to do is to have a call to DGEEV, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.

Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code. On SHARCNET a number of high quality libraries are available for use to use. The general recommendation is to use Intel's MKL library, which has a module loaded by default on most SHARCNET systems. Another popular option is the ACML library. The instructions on how to link your code with these libraries at compile time are provided on the MKL page and the ACML page.


My code is written in C/C++, can I still use those libraries?

Yes. Most of the libraries have C interfaces. If you are not sure about the C interface or you need assistance in using those libraries written in Fortran, we can help you out on a case to case basis.

What packages are available?

Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki (link). If you do not see a package that you need on this list, please request it by submitting a problem ticket.

You can also search this wiki or the Compute Canada wiki for the package you are interested in to see if there is any additional information about it available.

We can also help you with compiling/installing a package into your own file space if you prefer that approach.

What interconnects are used on SHARCNET clusters?

Currently, several different interconnects are being used on SHARCNET clusters: Quadrics, Myrinet, InfiniBand and standard IP-based ethernet.

Debugging serial and parallel programs

Debugger is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, -g, which tells the compiler to include symbolic information into the executable. For MPI problems on the HP XC clusters, -ldmpi includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.

SHARCNET highly recommends using our commercial debugger DDT. It has a very friendly GUI, and can also be used for debugging serial, threaded, and MPI programs. A short description of DDT and cluster availability information can be found on its software page. Please also refer to our detailed Parallel Debugging with DDT tutorial.

SHARCNET also provides gdb (installed on all clusters, type "man gdb" to get a list of options and see our Common Bugs and Debugging with gdb tutorial).


What is NaN ?

NaN stands for "Not a Number". It is an undefined or unrepresentable value, typically encountered in floating point arithmitic (eg. the square root of a negative number). To debug this in your program one typically has to unmask or trap floating point exceptions. This is fairly straightforward with Fortran compilers (e.g. with the Intel's ifort one simply needs to add one switch, "-fpe0"), but somewhat more complicated with C/C++ codes, where the best solution is to use feenableexcept() function. There are further details in the Common Bugs and Debugging with gdb tutorial.

My program exited with an error code XXX - what does it mean?

Your application crashed, producing an error code XXX (where XXX is a number). What does it mean? The answer may depend on your application. Normally, user codes are not touching the first 130 or so error codes, which are reserved for the Operational System level error codes. On most of our clusters, typing

 perror  XXX

will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file /usr/include/asm-x86_64/errno.h (/usr/include/asm-generic/errno.h on some systems).

When the error code is returned by the scheduler (when your program submitted to the scheduler with "sqsub" crashes), it has a different meaning. Specifically, if the code is less or equal to 128, it is the scheduler (not application's) error. Such situations have to be reported to SHARCNET staff. Scheduler exit codes between 129 and 255 are user job error codes; one has to subtract 128 to derive the usual OS error code.

On our systems that run Torque/Maui/Moab, exit code 271 means that your program has exceeded one of the resource limits you specified when you submitted your job, typically either the runtime limit or the memory limit. One can correct this by setting a larger runtime limit with the sqsub -r flag (up to the limit allowed by the queue, typically 7 days) or by setting a larger memory limit with the sqsub --mpp flag, depending on the message that was reported in your job output file (exceeding the runtime limit will often only result in a message indicating "killed"). Note that both of these values will be assigned reasonable defaults that depend on the system and may vary from system to system. Another common exit code relating to memory exhaustion is 41 -- this may be reported by a job in the done state and should correspond with an error message in your job output file.

For your convenience, we list OS error codes below:

 1  Operation not permitted
 2  No such file or directory
 3  No such process
 4  Interrupted system call
 5  I/O error
 6  No such device or address
 7  Arg list too long
 8  Exec format error
 9  Bad file number
10  No child processes
11  Try again
12  Out of memory
13  Permission denied
14  Bad address
15  Block device required
16  Device or resource busy
17  File exists
18  Cross-device link
19  No such device
20  Not a directory
21  Is a directory
22  Invalid argument
23  File table overflow
24  Too many open files
25  Not a typewriter
26  Text file busy
27  File too large
28  No space left on device
29  Illegal seek
30  Read-only file system
31  Too many links
32  Broken pipe
33  Math argument out of domain of func
34  Math result not representable
35  Resource deadlock would occur
36  File name too long
37  No record locks available
38  Function not implemented
39  Directory not empty
40  Too many symbolic links encountered
41  (Reserved error code)
42  No message of desired type
43  Identifier removed
44  Channel number out of range
45  Level 2 not synchronized
46  Level 3 halted
47  Level 3 reset
48  Link number out of range
49  Protocol driver not attached
50  No CSI structure available
51  Level 2 halted
52  Invalid exchange
53  Invalid request descriptor
54  Exchange full
55  No anode
56  Invalid request code
57  Invalid slot
58  (Reserved error code)
59  Bad font file format
60  Device not a stream
61  No data available
62  Timer expired
63  Out of streams resources
64  Machine is not on the network
65  Package not installed
66  Object is remote
67  Link has been severed
68  Advertise error
69  Srmount error
70  Communication error on send
71  Protocol error
72  Multihop attempted
73  RFS specific error
74  Not a data message
75  Value too large for defined data type
76  Name not unique on network
77  File descriptor in bad state
78  Remote address changed
79  Can not access a needed shared library
80  Accessing a corrupted shared library
81  .lib section in a.out corrupted
82  Attempting to link in too many shared libraries
83  Cannot exec a shared library directly
84  Illegal byte sequence
85  Interrupted system call should be restarted
86  Streams pipe error
87  Too many users
88  Socket operation on non-socket
89  Destination address required
90  Message too long
91  Protocol wrong type for socket
92  Protocol not available
93  Protocol not supported
94  Socket type not supported
95  Operation not supported on transport endpoint
96  Protocol family not supported
97  Address family not supported by protocol
98  Address already in use
99  Cannot assign requested address
100 Network is down
101 Network is unreachable
102 Network dropped connection because of reset
103 Software caused connection abort
104 Connection reset by peer
105 No buffer space available
106 Transport endpoint is already connected
107 Transport endpoint is not connected
108 Cannot send after transport endpoint shutdown
109 Too many references: cannot splice
110 Connection timed out
111 Connection refused
112 Host is down
113 No route to host
114 Operation already in progress
115 Operation now in progress
116 Stale NFS file handle
117 Structure needs cleaning
118 Not a XENIX named type file
119 No XENIX semaphores available
120 Is a named type file
121 Remote I/O error
122 Quota exceeded
123 No medium found
124 Wrong medium type
125 Operation Cancelled
126 Required key not available
127 Key has expired
128 Key has been revoked
129 Key was rejected by service