|Note: Some of the information on this page is for our legacy systems only. The page is scheduled for an update to make it applicable to Graham.|
- 1 Programming and Debugging
- 1.1 What is MPI?
- 1.2 What is OpenMP?
- 1.3 How do I run an OpenMP program with multiple threads?
- 1.4 How do I measure the cpu time when running multi-threaded job?
- 1.5 What mathematics libraries are available?
- 1.6 How do I use mathematics libraries such as BLAS and LAPACK routines?
- 1.7 My code is written in C/C++, can I still use those libraries?
- 1.8 What packages are available?
- 1.9 What interconnects are used on SHARCNET clusters?
- 1.10 Debugging serial and parallel programs
- 1.11 What is NaN ?
- 1.12 My program exited with an error code XXX - what does it mean?
Programming and Debugging
What is MPI?
MPI stands for Message Passing Interface, a standard for writing portable parallel programs which is well-accepted in the scientific computing community. MPI is implemented as a library of subroutines which is layered on top of a network interface. The MPI standard has provided both C/C++ and Fortran interfaces so all of these languages can use MPI. There are several MPI implementations, including OpenMPI and MPICH. Specific high-performance interconnect vendors also provide their own libraries - usually a version of MPICH layered on an interconnect-specific hardware library.
For an MPI tutorial refer to the MPI tutorial.
In addition to C/C++ and Fortran versions of MPI, there exist other language bindings as well. If you have any special needs, please contact us.
What is OpenMP?
OpenMP is a standard for programming shared memory systems using threads with compiler directives instrumented in the source code. It provides a higher-level approach to utilizing multiple processors within a single machine while keeping the structure of the source code as close to the conventional form as possible. OpenMP is much easier to use than the alternative (Pthreads) and thus is suitable for adding modest amounts of parallelism to pre-exiting code. Because OpenMP is a set of programs, your code can still be compiled by a serial compiler and should still behave the same.
OpenMP for C/C++ and Fortran are supported by many compilers, including the PathScale and PGI for Opterons, and the Intel compilers for IA32 and IA64 (such as SGI's Altix.). OpenMP support has been provided in the GNU compiler suite since v4.2 (OpenMP 2.5), and starting with v4.4 supports the OpenMP 3.0 standard.
How do I run an OpenMP program with multiple threads?
An OpenMP program uses a single process with multiple threads rather than multiple processes. On multicore (i.e practically all) systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use. To run an OpenMP program foo that uses four threads, use the following job submission script.
The option --cpus-per-task=4 specifies to reserve 4 CPUs per process.
#!/bin/bash #SBATCH --account=def-someuser #SBATCH --time=0-0:5 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./ompHello
For a basic OpenMP tutorial refer to OpenMP tutorial.
How do I measure the cpu time when running multi-threaded job?
It appears the easiest solution for you to use a simple benchmarking script, call it "time.sh" (should be made executable, "chmod u+x time.sh"):
#!/bin/bash /usr/bin/time $*
You should place it in the same directory as you code binary, and insert "./time.sh" before the binary name in you sqsub command, e.g.
sqsub -q threaded -n8 -r2m -o out2 ./time.sh ./code code_arguments ...
I just tested it with a simple threaded application, and it does work with multiple threads. You'll get the output like this:
494.62user 0.98system 1:02.23elapsed 796%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+515minor)pagefaults 0swaps
As you can see, CPU cycles from all 8 threads were counted (796%CPU).
What mathematics libraries are available?
Every system has the basic linear algebra libraries BLAS and LAPACK installed. Normally, these interfaces are contained in vendor-tuned libraries. On Intel-based (Xeon) clusters it's probably best to use the Intel math kernel library (MKL). On Opteron-based clusters, AMD's ACML library is available. However, either library will work reasonably well on both types of systems. If one expects to do a large amount of computation, it is generally advisable to benchmark both libraries so that one selects the one offering best performance for a given problem and system.
One may also find the GNU scientific library (GSL) useful to some point for their particular needs. The GNU scientific library is an optional package, available on any machine.
For a detailed list of libraries on each clusters, please check the documentation on the corresponding SHARCNET satellite web sites
How do I use mathematics libraries such as BLAS and LAPACK routines?
First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an N by N real non symmetric matrix in double precision, you find the LAPACK routine DGEEV will do that. All you need to do is to have a call to DGEEV, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.
Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code. The general recommendation is to use Intel's MKL library, which has a module loaded by default on most Compute Canada/SHARCNET systems. The instructions on how to link your code with these libraries at compile time are provided on the MKL page.
My code is written in C/C++, can I still use those libraries?
Yes. Most of the libraries have C interfaces. If you are not sure about the C interface or you need assistance in using those libraries written in Fortran, we can help you out on a case to case basis.
What packages are available?
Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki (link). If you do not see a package that you need on this list, please request it by submitting a problem ticket.
You can also search this wiki or the Compute Canada wiki for the package you are interested in to see if there is any additional information about it available.
We can also help you with compiling/installing a package into your own file space if you prefer that approach.
What interconnects are used on SHARCNET clusters?
Debugging serial and parallel programs
Debugger is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, -g, which tells the compiler to include symbolic information into the executable. For MPI problems on the HP XC clusters, -ldmpi includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.
SHARCNET highly recommends using our commercial debugger DDT. It has a very friendly GUI, and can also be used for debugging serial, threaded, and MPI programs. A short description of DDT and cluster availability information can be found on its software page. Please also refer to our detailed Parallel Debugging with DDT tutorial.
SHARCNET also provides gdb (installed on all clusters, type "man gdb" to get a list of options and see our Common Bugs and Debugging with gdb tutorial).
What is NaN ?
NaN stands for "Not a Number". It is an undefined or unrepresentable value, typically encountered in floating point arithmitic (eg. the square root of a negative number). To debug this in your program one typically has to unmask or trap floating point exceptions. This is fairly straightforward with Fortran compilers (e.g. with the Intel's ifort one simply needs to add one switch, "-fpe0"), but somewhat more complicated with C/C++ codes, where the best solution is to use feenableexcept() function. There are further details in the Common Bugs and Debugging with gdb tutorial.
My program exited with an error code XXX - what does it mean?
Your application crashed, producing an error code XXX (where XXX is a number). What does it mean? The answer may depend on your application. Normally, user codes are not touching the first 130 or so error codes, which are reserved for the Operational System level error codes. On most of our clusters, typing
will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file /usr/include/asm-x86_64/errno.h (/usr/include/asm-generic/errno.h on some systems).
When the error code is returned by the scheduler (when your program submitted to the scheduler with "sqsub" crashes), it has a different meaning. Specifically, if the code is less or equal to 128, it is the scheduler (not application's) error. Such situations have to be reported to SHARCNET staff. Scheduler exit codes between 129 and 255 are user job error codes; one has to subtract 128 to derive the usual OS error code.
On our systems that run Torque/Maui/Moab, exit code 271 means that your program has exceeded one of the resource limits you specified when you submitted your job, typically either the runtime limit or the memory limit. One can correct this by setting a larger runtime limit with the sqsub -r flag (up to the limit allowed by the queue, typically 7 days) or by setting a larger memory limit with the sqsub --mpp flag, depending on the message that was reported in your job output file (exceeding the runtime limit will often only result in a message indicating "killed"). Note that both of these values will be assigned reasonable defaults that depend on the system and may vary from system to system. Another common exit code relating to memory exhaustion is 41 -- this may be reported by a job in the done state and should correspond with an error message in your job output file.
For your convenience, we list OS error codes below:
1 Operation not permitted 2 No such file or directory 3 No such process 4 Interrupted system call 5 I/O error 6 No such device or address 7 Arg list too long 8 Exec format error 9 Bad file number 10 No child processes 11 Try again 12 Out of memory 13 Permission denied 14 Bad address 15 Block device required 16 Device or resource busy 17 File exists 18 Cross-device link 19 No such device 20 Not a directory 21 Is a directory 22 Invalid argument 23 File table overflow 24 Too many open files 25 Not a typewriter 26 Text file busy 27 File too large 28 No space left on device 29 Illegal seek 30 Read-only file system 31 Too many links 32 Broken pipe 33 Math argument out of domain of func 34 Math result not representable 35 Resource deadlock would occur 36 File name too long 37 No record locks available 38 Function not implemented 39 Directory not empty 40 Too many symbolic links encountered 41 (Reserved error code) 42 No message of desired type 43 Identifier removed 44 Channel number out of range 45 Level 2 not synchronized 46 Level 3 halted 47 Level 3 reset 48 Link number out of range 49 Protocol driver not attached 50 No CSI structure available 51 Level 2 halted 52 Invalid exchange 53 Invalid request descriptor 54 Exchange full 55 No anode 56 Invalid request code 57 Invalid slot 58 (Reserved error code) 59 Bad font file format 60 Device not a stream 61 No data available 62 Timer expired 63 Out of streams resources 64 Machine is not on the network 65 Package not installed 66 Object is remote 67 Link has been severed 68 Advertise error 69 Srmount error 70 Communication error on send 71 Protocol error 72 Multihop attempted 73 RFS specific error 74 Not a data message 75 Value too large for defined data type 76 Name not unique on network 77 File descriptor in bad state 78 Remote address changed 79 Can not access a needed shared library 80 Accessing a corrupted shared library 81 .lib section in a.out corrupted 82 Attempting to link in too many shared libraries 83 Cannot exec a shared library directly 84 Illegal byte sequence 85 Interrupted system call should be restarted 86 Streams pipe error 87 Too many users 88 Socket operation on non-socket 89 Destination address required 90 Message too long 91 Protocol wrong type for socket 92 Protocol not available 93 Protocol not supported 94 Socket type not supported 95 Operation not supported on transport endpoint 96 Protocol family not supported 97 Address family not supported by protocol 98 Address already in use 99 Cannot assign requested address 100 Network is down 101 Network is unreachable 102 Network dropped connection because of reset 103 Software caused connection abort 104 Connection reset by peer 105 No buffer space available 106 Transport endpoint is already connected 107 Transport endpoint is not connected 108 Cannot send after transport endpoint shutdown 109 Too many references: cannot splice 110 Connection timed out 111 Connection refused 112 Host is down 113 No route to host 114 Operation already in progress 115 Operation now in progress 116 Stale NFS file handle 117 Structure needs cleaning 118 Not a XENIX named type file 119 No XENIX semaphores available 120 Is a named type file 121 Remote I/O error 122 Quota exceeded 123 No medium found 124 Wrong medium type 125 Operation Cancelled 126 Required key not available 127 Key has expired 128 Key has been revoked 129 Key was rejected by service