From Documentation
Jump to: navigation, search
(What if I do not want a core dump: core dump file size is set to zero on new clusters, so this section is no longer relevant)
(My program exited with an error code XXX - what does it mean?)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!--This page is transcluded to the main FAQ page.  If you make changes here, make sure the changes show up on the main FAQ page.  You may have to make an edit to the main FAQ page to force a refresh. -->
 
<!--This page is transcluded to the main FAQ page.  If you make changes here, make sure the changes show up on the main FAQ page.  You may have to make an edit to the main FAQ page to force a refresh. -->
{{Template:GrahamUpdate}}
 
 
== Programming and Debugging ==
 
== Programming and Debugging ==
  
Line 19: Line 18:
 
=== How do I run an OpenMP program with multiple threads? ===
 
=== How do I run an OpenMP program with multiple threads? ===
  
An OpenMP program uses a single process with multiple threads rather than multiple processes. On SMP systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use.  To run an OpenMP program foo that uses four threads with sqsub command, use the following
+
An OpenMP program uses a single process with multiple threads rather than multiple processes. On multicore (i.e practically all)  systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use.  To run an OpenMP program foo that uses four threads, use the following job submission script.
  
sqsub -q threaded -n 4 -r 5m ./foo
+
The option <tt>--cpus-per-task=4 </tt> specifies to reserve 4 CPUs per process.
  
The option <tt>-n 4</tt> specifies to reserve 4 CPUs per process.  
+
#!/bin/bash
 +
#SBATCH --account=def-someuser
 +
#SBATCH --time=0-0:5
 +
#SBATCH --cpus-per-task=4
 +
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
 +
./ompHello
  
 
For a basic OpenMP tutorial refer to [https://www.sharcnet.ca/help/index.php/OpenMP OpenMP tutorial].
 
For a basic OpenMP tutorial refer to [https://www.sharcnet.ca/help/index.php/OpenMP OpenMP tutorial].
Line 29: Line 33:
 
=== How do I measure the cpu time when running multi-threaded job? ===
 
=== How do I measure the cpu time when running multi-threaded job? ===
  
It appears the easiest solution for you to use a simple benchmarking script, call it "time.sh" (should be made executable, "chmod u+x time.sh"):
+
If you submit a job through the scheduler, then timing information will be collected by the scheduler itself and stored for later query.
  
#!/bin/bash
+
If you are running an OpenMP program interactively, you can use the ''time'' utility to collect information.
/usr/bin/time $*
+
  
You should place it in the same directory as you code binary, and insert "./time.sh" before the binary name in you sqsub command, e.g.
+
In a typical example using 8 threads,  
  
  sqsub -q threaded -n8 -r2m -o out2 ./time.sh ./code code_arguments ...
+
  export OMP_NUM_THREADS=8
 +
time ./ompHello
  
I just tested it with a simple threaded application, and it does work with multiple threads. You'll get the output like this:
+
Your output will be something like:
  
  494.62user 0.98system 1:02.23elapsed 796%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+515minor)pagefaults 0swaps
+
  real 0m1.633s
 +
user 0m1.132s
 +
sys 0m0.917s
  
As you can see, CPU cycles from all 8 threads were counted (796%CPU).  
+
In this example the real and user time are comparable, so the particular example program is not benefitting from multithreading.  In general, ''real'' time should be less than ''user'' time if parallel execution is occurring.
  
 
=== What mathematics libraries are available? ===
 
=== What mathematics libraries are available? ===
Line 56: Line 62:
 
First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an ''N by N'' real non symmetric matrix in double precision, you find the LAPACK routine <tt>DGEEV</tt> will do that. All you need to do is to have a call to <tt>DGEEV</tt>, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.
 
First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an ''N by N'' real non symmetric matrix in double precision, you find the LAPACK routine <tt>DGEEV</tt> will do that. All you need to do is to have a call to <tt>DGEEV</tt>, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.
  
Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code.  On SHARCNET a number of high quality libraries are available for use to use.  The general recommendation is to use [http://software.intel.com/en-us/intel-mkl Intel's MKL library], which has a module loaded by default on most SHARCNET systems.  Another popular option is the [http://developer.amd.com/tools/cpu-development/amd-core-math-library-acml/ ACML library]. The instructions on how to link your code with these libraries at compile time are provided on the [[MKL|MKL page]] and the [[ACML|ACML page]].
+
Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code.  The general recommendation is to use [http://software.intel.com/en-us/intel-mkl Intel's MKL library], which has a module loaded by default on most Compute Canada/SHARCNET systems. The instructions on how to link your code with these libraries at compile time are provided on the [[MKL|MKL page]].
  
 
<!--  This text has deprecated as we don't support our "compile" script anymore so users have to use the -llapack flag.
 
<!--  This text has deprecated as we don't support our "compile" script anymore so users have to use the -llapack flag.
Line 81: Line 87:
 
=== What packages are available? ===
 
=== What packages are available? ===
  
Various packages have been installed on SHARCNET clusters at users' requests. Custom installed packages include, for example, [http://www.gaussian.com/ Gaussian], [http://www.mcs.anl.gov/petsc/petsc-2/ PETSc], [http://www.r-project.org/ R], [http://www.msg.chem.iastate.edu/gamess/ Gamess], [http://www.ks.uiuc.edu/Research/vmd/ VMD], and [http://www.maplesoft.com/ Maple], and many others. Please check the SHARCNET [http://www.sharcnet.ca web portal] for the [https://www.sharcnet.ca/my/software software packages] installed and related usage information.
+
Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki [https://docs.computecanada.ca/wiki/Modules (link)].  If you do not see a package that you need on this list, please request it by submitting a problem ticket.
 +
 
 +
You can also search this wiki or the [https://docs.computecanada.ca/wiki/Compute_Canada_Documentation Compute Canada wiki] for the package you are interested in to see if there is any additional information about it available.
 +
 
 +
We can also help you with compiling/installing a package into your own file space if you prefer that approach.
  
 
=== What interconnects are used on SHARCNET clusters? ===
 
=== What interconnects are used on SHARCNET clusters? ===
Line 91: Line 101:
 
''Debugger'' is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, <tt>-g</tt>, which tells the compiler to include symbolic information into the executable.  For MPI problems on the HP XC clusters, <tt>-ldmpi</tt> includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.
 
''Debugger'' is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, <tt>-g</tt>, which tells the compiler to include symbolic information into the executable.  For MPI problems on the HP XC clusters, <tt>-ldmpi</tt> includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.
  
SHARCNET highly recommends using our commercial debugger DDT.  It has a very friendly GUI, and can also be used for debugging serial, threaded, and MPI programs.  A short description of DDT and cluster availability information can be found on its [http://www.sharcnet.ca/my/software/show/36 software page].  Please also refer to our detailed [https://www.sharcnet.ca/help/index.php/Parallel_Debugging_with_DDT Parallel Debugging with DDT] tutorial.
+
SHARCNET highly recommends using our commercial debugger DDT.  It has a very friendly GUI, and can also be used for debugging serial, threaded, MPI, and CUDA (GPGPU) programs.  A short description of DDT and cluster availability information can be found on its [http://www.sharcnet.ca/my/software/show/36 software page].  Please also refer to our detailed [https://www.sharcnet.ca/help/index.php/Parallel_Debugging_with_DDT Parallel Debugging with DDT] tutorial.
  
 
SHARCNET also provides <tt>gdb</tt> (installed on all clusters, type "<tt>man gdb</tt>" to get a list of options and see our [[Common Bugs and Debugging with gdb]] tutorial).   
 
SHARCNET also provides <tt>gdb</tt> (installed on all clusters, type "<tt>man gdb</tt>" to get a list of options and see our [[Common Bugs and Debugging with gdb]] tutorial).   
Line 109: Line 119:
 
   perror  XXX
 
   perror  XXX
  
will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file <tt>/usr/include/asm-x86_64/errno.h</tt> (<tt>/usr/include/asm-generic/errno.h</tt> on some systems).  
+
will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file <tt>/usr/include/asm-x86_64/errno.h</tt> (<tt>/usr/include/asm-generic/errno.h</tt> on some systems).
 
+
When the error code is returned by the scheduler (when your program submitted to the scheduler with "<tt>sqsub</tt>" crashes), it has a different meaning. Specifically, if the code is less or equal to 128, it is the scheduler (not application's) error. Such situations have to be reported to SHARCNET staff. Scheduler exit codes between 129 and 255 are user job error codes; one has to subtract 128 to derive the usual OS error code.
+
 
+
On our systems that run Torque/Maui/Moab, exit code 271 means that your program has exceeded one of the resource limits you specified when you submitted your job, typically either the runtime limit or the memory limit.  One can correct this by setting a larger runtime limit with the ''sqsub -r'' flag (up to the limit allowed by the queue, typically 7 days) or by setting a larger memory limit with the ''sqsub --mpp'' flag, depending on the message that was reported in your job output file (exceeding the runtime limit will often only result in a message indicating "killed").  Note that both of these values will be assigned reasonable defaults that depend on the system and may vary from system to system.  Another common exit code relating to memory exhaustion is 41 -- this may be reported by a job in the ''done'' state and should correspond with an error message in your job output file. 
+
 
+
For your convenience, we list OS error codes below:
+
 
+
  1  Operation not permitted
+
  2  No such file or directory
+
  3  No such process
+
  4  Interrupted system call
+
  5  I/O error
+
  6  No such device or address
+
  7  Arg list too long
+
  8  Exec format error
+
  9  Bad file number
+
10  No child processes
+
11  Try again
+
12  Out of memory
+
13  Permission denied
+
14  Bad address
+
15  Block device required
+
16  Device or resource busy
+
17  File exists
+
18  Cross-device link
+
19  No such device
+
20  Not a directory
+
21  Is a directory
+
22  Invalid argument
+
23  File table overflow
+
24  Too many open files
+
25  Not a typewriter
+
26  Text file busy
+
27  File too large
+
28  No space left on device
+
29  Illegal seek
+
30  Read-only file system
+
31  Too many links
+
32  Broken pipe
+
33  Math argument out of domain of func
+
34  Math result not representable
+
35  Resource deadlock would occur
+
36  File name too long
+
37  No record locks available
+
38  Function not implemented
+
39  Directory not empty
+
40  Too many symbolic links encountered
+
41  (Reserved error code)
+
42  No message of desired type
+
43  Identifier removed
+
44  Channel number out of range
+
45  Level 2 not synchronized
+
46  Level 3 halted
+
47  Level 3 reset
+
48  Link number out of range
+
49  Protocol driver not attached
+
50  No CSI structure available
+
51  Level 2 halted
+
52  Invalid exchange
+
53  Invalid request descriptor
+
54  Exchange full
+
55  No anode
+
56  Invalid request code
+
57  Invalid slot
+
58  (Reserved error code)
+
59  Bad font file format
+
60  Device not a stream
+
61  No data available
+
62  Timer expired
+
63  Out of streams resources
+
64  Machine is not on the network
+
65  Package not installed
+
66  Object is remote
+
67  Link has been severed
+
68  Advertise error
+
69  Srmount error
+
70  Communication error on send
+
71  Protocol error
+
72  Multihop attempted
+
73  RFS specific error
+
74  Not a data message
+
75  Value too large for defined data type
+
76  Name not unique on network
+
77  File descriptor in bad state
+
78  Remote address changed
+
79  Can not access a needed shared library
+
80  Accessing a corrupted shared library
+
81  .lib section in a.out corrupted
+
82  Attempting to link in too many shared libraries
+
83  Cannot exec a shared library directly
+
84  Illegal byte sequence
+
85  Interrupted system call should be restarted
+
86  Streams pipe error
+
87  Too many users
+
88  Socket operation on non-socket
+
89  Destination address required
+
90  Message too long
+
91  Protocol wrong type for socket
+
92  Protocol not available
+
93  Protocol not supported
+
94  Socket type not supported
+
95  Operation not supported on transport endpoint
+
96  Protocol family not supported
+
97  Address family not supported by protocol
+
98  Address already in use
+
99  Cannot assign requested address
+
100 Network is down
+
101 Network is unreachable
+
102 Network dropped connection because of reset
+
103 Software caused connection abort
+
104 Connection reset by peer
+
105 No buffer space available
+
106 Transport endpoint is already connected
+
107 Transport endpoint is not connected
+
108 Cannot send after transport endpoint shutdown
+
109 Too many references: cannot splice
+
110 Connection timed out
+
111 Connection refused
+
112 Host is down
+
113 No route to host
+
114 Operation already in progress
+
115 Operation now in progress
+
116 Stale NFS file handle
+
117 Structure needs cleaning
+
118 Not a XENIX named type file
+
119 No XENIX semaphores available
+
120 Is a named type file
+
121 Remote I/O error
+
122 Quota exceeded
+
123 No medium found
+
124 Wrong medium type
+
125 Operation Cancelled
+
126 Required key not available
+
127 Key has expired
+
128 Key has been revoked
+
129 Key was rejected by service
+

Latest revision as of 16:22, 8 February 2019

Programming and Debugging

What is MPI?

MPI stands for Message Passing Interface, a standard for writing portable parallel programs which is well-accepted in the scientific computing community. MPI is implemented as a library of subroutines which is layered on top of a network interface. The MPI standard has provided both C/C++ and Fortran interfaces so all of these languages can use MPI. There are several MPI implementations, including OpenMPI and MPICH. Specific high-performance interconnect vendors also provide their own libraries - usually a version of MPICH layered on an interconnect-specific hardware library.

For an MPI tutorial refer to the MPI tutorial.

In addition to C/C++ and Fortran versions of MPI, there exist other language bindings as well. If you have any special needs, please contact us.

What is OpenMP?

OpenMP is a standard for programming shared memory systems using threads with compiler directives instrumented in the source code. It provides a higher-level approach to utilizing multiple processors within a single machine while keeping the structure of the source code as close to the conventional form as possible. OpenMP is much easier to use than the alternative (Pthreads) and thus is suitable for adding modest amounts of parallelism to pre-exiting code. Because OpenMP is a set of programs, your code can still be compiled by a serial compiler and should still behave the same.

OpenMP for C/C++ and Fortran are supported by many compilers, including the PathScale and PGI for Opterons, and the Intel compilers for IA32 and IA64 (such as SGI's Altix.). OpenMP support has been provided in the GNU compiler suite since v4.2 (OpenMP 2.5), and starting with v4.4 supports the OpenMP 3.0 standard.

How do I run an OpenMP program with multiple threads?

An OpenMP program uses a single process with multiple threads rather than multiple processes. On multicore (i.e practically all) systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use. To run an OpenMP program foo that uses four threads, use the following job submission script.

The option --cpus-per-task=4 specifies to reserve 4 CPUs per process.

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./ompHello

For a basic OpenMP tutorial refer to OpenMP tutorial.

How do I measure the cpu time when running multi-threaded job?

If you submit a job through the scheduler, then timing information will be collected by the scheduler itself and stored for later query.

If you are running an OpenMP program interactively, you can use the time utility to collect information.

In a typical example using 8 threads,

export OMP_NUM_THREADS=8
time ./ompHello

Your output will be something like:

real	0m1.633s
user	0m1.132s
sys	0m0.917s

In this example the real and user time are comparable, so the particular example program is not benefitting from multithreading. In general, real time should be less than user time if parallel execution is occurring.

What mathematics libraries are available?

Every system has the basic linear algebra libraries BLAS and LAPACK installed. Normally, these interfaces are contained in vendor-tuned libraries. On Intel-based (Xeon) clusters it's probably best to use the Intel math kernel library (MKL). On Opteron-based clusters, AMD's ACML library is available. However, either library will work reasonably well on both types of systems. If one expects to do a large amount of computation, it is generally advisable to benchmark both libraries so that one selects the one offering best performance for a given problem and system.

One may also find the GNU scientific library (GSL) useful to some point for their particular needs. The GNU scientific library is an optional package, available on any machine.

For a detailed list of libraries on each clusters, please check the documentation on the corresponding SHARCNET satellite web sites

How do I use mathematics libraries such as BLAS and LAPACK routines?

First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an N by N real non symmetric matrix in double precision, you find the LAPACK routine DGEEV will do that. All you need to do is to have a call to DGEEV, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.

Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code. The general recommendation is to use Intel's MKL library, which has a module loaded by default on most Compute Canada/SHARCNET systems. The instructions on how to link your code with these libraries at compile time are provided on the MKL page.


My code is written in C/C++, can I still use those libraries?

Yes. Most of the libraries have C interfaces. If you are not sure about the C interface or you need assistance in using those libraries written in Fortran, we can help you out on a case to case basis.

What packages are available?

Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki (link). If you do not see a package that you need on this list, please request it by submitting a problem ticket.

You can also search this wiki or the Compute Canada wiki for the package you are interested in to see if there is any additional information about it available.

We can also help you with compiling/installing a package into your own file space if you prefer that approach.

What interconnects are used on SHARCNET clusters?

Currently, several different interconnects are being used on SHARCNET clusters: Quadrics, Myrinet, InfiniBand and standard IP-based ethernet.

Debugging serial and parallel programs

Debugger is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, -g, which tells the compiler to include symbolic information into the executable. For MPI problems on the HP XC clusters, -ldmpi includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.

SHARCNET highly recommends using our commercial debugger DDT. It has a very friendly GUI, and can also be used for debugging serial, threaded, MPI, and CUDA (GPGPU) programs. A short description of DDT and cluster availability information can be found on its software page. Please also refer to our detailed Parallel Debugging with DDT tutorial.

SHARCNET also provides gdb (installed on all clusters, type "man gdb" to get a list of options and see our Common Bugs and Debugging with gdb tutorial).


What is NaN ?

NaN stands for "Not a Number". It is an undefined or unrepresentable value, typically encountered in floating point arithmitic (eg. the square root of a negative number). To debug this in your program one typically has to unmask or trap floating point exceptions. This is fairly straightforward with Fortran compilers (e.g. with the Intel's ifort one simply needs to add one switch, "-fpe0"), but somewhat more complicated with C/C++ codes, where the best solution is to use feenableexcept() function. There are further details in the Common Bugs and Debugging with gdb tutorial.

My program exited with an error code XXX - what does it mean?

Your application crashed, producing an error code XXX (where XXX is a number). What does it mean? The answer may depend on your application. Normally, user codes are not touching the first 130 or so error codes, which are reserved for the Operational System level error codes. On most of our clusters, typing

 perror  XXX

will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file /usr/include/asm-x86_64/errno.h (/usr/include/asm-generic/errno.h on some systems).