From Documentation
Jump to: navigation, search

Overview

This tutorial provides a concise explanation of frequently encountered bugs and how to use gdb to debug serial programs at SHARCNET. It is not comprehensive and only aims to give users enough knowledge to get started on their own.

To debug parallel programs users should consult the SHARCNET Parallel Debugging with DDT tutorial. Note that it can also be used to debug serial programs, should one desire a graphical interface.

Identifying bugs and errors

Typically one realizes they've encountered a problem with their program when it fails to complete (crashes) or when it doesn't produce the expected output (either corrupted/incorrect output or a failure to progress (hangs)).

One can determine that a job exited in an erroneous state by inspecting the job exit code in the web portal (see the jobs table at the bottom of your your activity page) or by looking at the job output file (every job submission with sqsub should use the -o option to specify one). The output from the job can also indicate a problem with the state of the program or a lack of progress. For more information on diagnosing the behavior of jobs see Monitoring Jobs.

When a job fails it's output may contain a runtime error message or a signal from the operating system that helps identify the problem. If no error message is generated or if the message is insufficient one can use a debugger to manipulate and inspect the code as it is running to identify the nature and scope of the problem.

Common bugs and errors

Some frequently encountered OS signals resulting from a program encountering an erroneous state include:

Signal NameOS signal #OS signal nameDescription
Floating point exception8SIGFPEThe program attempted an arithmetic operation with values that do not make sense (eg. divide by zero)
Segmentation fault11SIGSEGVThe program accessed memory incorrectly (eg. accessing an array beyond it's declared bounds), exceeding environment limits (eg. stacksize)
Aborted6SIGABRTGenerated by the runtime library of the program or a library it uses, after having detected a failure condition

Another problem that is common in scientific computing is the handling of exceptional values. Depending on the compiler and the runtime environment a program may produce NaN values that go unreported and lead to the corruption of your results (and wasting lots of cycles). By selecting the right compiler flags (or if necessary, library functions) you can cause the creation of these values to produce a floating point exception instead.

Runtime errors are more verbose than signals from the OS, allowing some problems to be resolved without the need to debug, especially if one has a thorough knowledge of the code. For an unfamiliar code, one can use the debugger to inspect the state of the program when it triggered the error.

For a comprehensive list of common bugs see this wikipedia article.

Preparing your program for debugging

In order to run a program in a debugger, it should be compiled to include a symbol table. With most compilers, this means adding the -g flag to the compile line.

One should also disable all processor optimzations, by specifying the -O0 flag (or equivalent), otherwise compiler optimizations may lead to misleading debugger behavior or obscure the bug (this is not the case for all compilers).

Finally, depending on the compiler and how it is called in the default environment, a program may not stop and issue a SIGFPE after doing something non-sensical. Instead, it will continue with the stored value represented by a NaN (not a number) or an Inf (infinity) value. When debugging one should turn off this masking behaviour (or in other words, trap the creation of exceptional values) to help identify any potential problems with the code. The following addresses the current behaviour of the default SHARCNET compilers:

One can see what is set by default by running the compiler with the -show flag.

Pathscale Compilers

 [merz@tig241 ~]$ cc -show
 pathcc -Wall -O3 -OPT:Ofast -fno-math-errno ...

To unmask FPE's one should add these flags:

 -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF

Note that these TENV flags are only for the Pathscale compilers (ie. when one has successfully done a module load pathscale). They automatically allow programs to stop when they encounter an FPE.

Intel Compilers

 [merz@tig241 ~]$ cc -show
 icc -O3 -vec-report0 ...

Note that by default the Intel compilers default to masking exceptions and producing NaN/Inf values. For the Intel Fortran Compiler (ifort) one can turn off this behavior by specifying the -fpe0 flag to allow programs to stop when they encounter an FPE. No equivalent flag exists in the Intel C/C++ Compiler, so one must implement signal handling in their code. One way to do this is with the glibc feenableexcept() function, as described here.

Using gdb

To illustrate the debugging process, there are C and Fortran example codes at the end of the tutorial that include both a floating point error and a segmentation fault. These examples are trivial, and are simply intended to show how easy it is to use the debugger. Note that the behaviour of the debugger is the same regardless of the language one is using (for the most part!), so we'll show the Fortran FPE example and the C Segmentation Fault in the walk-through that follows (on requin and kraken respectively). In addition, the walk-through only addresses the behaviour of the Pathscale compilers on the XC systems (in particular, on kraken) - one will need to keep the above comments on the Intel compiler in mind for any debugging on other systems.

Please note: in the following examples one can simply run the example programs in the debugger on the login node as the programs are small and don't use a lot of resources or run for a very long time. If you are debugging a large-memory program or a program that takes longer than a few seconds to run you should use the cluster development nodes (available on kraken, saw and orca) or submit it as a job on the cluster using the core file method outlined below. It is SHARCNET policy that users refrain from running anything substantial on the login nodes as these are a shared resource.

first bug: an FPE

First, to illustrate what happens when the code is run as is:

[snuser@req770 bugs]$ f90  -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF bugs.f90
[merz@req770 bugs]$ ./a.out
Floating point exception signaled at 400f6b: floating point divide by zero
Aborted

Submitting this job to the queuing system produces an output file that looks like:

[snuser@req770 bugs]$ sqsub -t -r 1m -o bugs.out ./a.out 
submitted as jobid 872464
<wait a bit...>


[snuser@req770 bugs]$ cat bugs.out 
Floating point exception signaled at 400f6b: floating point divide by zero
srun: error: req767: task0: Aborted
srun: Terminating job
------------------------------------------------------------
Sender: LSF System <lsfadmin@lsfhost.localdomain>
Subject: Job 872464: <./a.out> Exited

Job <./a.out> was submitted from host <req770> by user <snuser>.
Job was executed on host(s) <lsfhost.localdomain>, in queue <test>, as user <snuser>.
</home/snuser> was used as the home directory.
</work/snuser/bugs> was used as the working directory.
Started at Fri Jan 23 13:51:18 2014
Results reported at Fri Jan 23 13:52:22 2014

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
./a.out
------------------------------------------------------------

Exited with exit code 136.

Resource usage summary:

   CPU time   :      0.26 sec.
   Max Memory :       104 KB
   Max Swap   :       884 KB


The output (if any) is above this job summary.

[snuser@req770 bugs]$ 

Notice the Floating point exception message, and the fact that it exited with code 136. To debug it in gdb:

First compile:

[snuser@req770 bugs]$ f90  -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF -O0 -g bugs.f90

Now start the debugger, specifying the program we want to debug:

[snuser@req770 bugs]$ gdb a.out
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

At this point, the program will be loaded, but is not running, so start it:

(gdb) r
Starting program: /req_sfs/work/snuser/bugs/a.out
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400f81 in divide (d=0, e=1) at /home/merz/bugs/bugs.f90:19
19	     print *,e/d
Current language:  auto; currently fortran

Note that the debugger will stop at the FPE, and show which function/routine it was in, what values input arguments had, the line number of the source file where the problem occured, and the actual line of the file. In this case this output is sufficient to diagnose the problem: clearly e/d is undefined since the denominator is zero. One can also look at a stack trace, to see what has been called up till this point:

(gdb) where
#0  0x0000000000400f81 in divide (d=0, e=1) at /home/sndemo/bugs/bugs.f90:19
#1  0x0000000000400edd in MAIN__ () at /home/sndemo/bugs/bugs.f90:12
#2  0x0000000000401406 in __f90_main (argc=1, argv=0x7fbfffea88, arge=0x7fbfffea98) at /home/hahn/gwork/path64-suite/compiler/compiler/src/libpathfortran/../libf/fio/main.c:45
#3  0x00000000004011d4 in main (argc=1, argv=0x7fbfffea88, arge=0x7fbfffea98) at /home/hahn/gwork/path64-suite/compiler/compiler/src/libpathfortran/fnord.c:64

(Note that for fortran the extra __f90_main and main routines are part of the runtime)

An important caveat concerning the stack trace is that the debugger may display a deep stack (ie. a long list of functions that have been entered), indicating a problem triggered inside a system library. While the system library function is the last function that was executed before the program failed, it is unlikely that there is actually a bug in the system library. One should trace back through the stack to the last call from the program into the library and inspect the arguments that were given to the library function to ensure that they are sensible - typically errors in system libraries occur when the library functions are called with incorrect arguments.

In addition to the stack trace, one may look at the source code file, centered around a particular line:

(gdb) l 19
14	 end program
15	 
16	 subroutine divide(d,e)
17	     implicit none
18	     real d,e
19	     print *,e/d
20	 end subroutine
21	 
22	 subroutine arrayq(f,g)
23	     implicit none

One can inspect the values of different variables in the current level of the stack:

(gdb) p d
$1 = 0
(gdb) p e
$2 = 1

Or one can go "up" the stack to look at values in the calling function/routine:

(gdb) up
#1  0x0000000000400edd in MAIN__ () at /home/merz/bugs/bugs.f90:12
12	     call divide(a,b)
(gdb) p a
$3 = 0
(gdb) p b
$4 = 1

When one is finished, it's easy to exit:

(gdb) q
The program is running.  Exit anyway? (y or n) y
[snuser@req770 bugs]$

second bug: a segmentation fault

Now, to illustrate a segfault, change the denominator in bugs.c to be non-zero, eg. a=4.0 Compile the modified code, and run it to see what happens:

[snuser@nar316 bugs]$ ./a.out
0.250000
Segmentation fault
[snuser@bul135 bugs]$ sqsub -r 10m -o bugs.1.out ./a.out 
THANK YOU for providing a runtime estimate of 10m!
submitted as jobid 122902
[snuser@bul135 bugs]$ cat bugs.1.out 
/var/spool/torque/mom_priv/jobs/9037083.krasched.SC: line 3:  9813 Segmentation fault      ./a.out
--- SharcNET Job Epilogue ---
             job id: 9037083
        exit status: 139
           cpu time: 0 / 600s (0 %)
WARNING: Job died due to SIGSEGV - Invalid memory reference
WARNING: Job only used 0 % of its requested cpu time.

Notice the Segmentation fault message, and the fact that the job exited with code 139. To debug it in gdb:

[snuser@bul135 bugs]$ gdb a.out 
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) r
Starting program: /home/snuser/bugs/a.out 
0.250000
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffe980, q=12000000) at /nar_sfs/work/snuser/bugs/bugs.c:10
10        printf("%f\n",f[q]);

Note that the program stops automatically when it hits the segmentation fault, and shows you which function it is in, the values of the input variables, and the line in the source. One can then try printing out the values of the array, to see why it would have a problem:

(gdb) p f
$1 = (float *) 0x7fbfffe980
(gdb) p f[1]
$2 = 1
(gdb) p f[9]
$3 = 9
(gdb) p f[q]
Cannot access memory at address 0x7fc2dc5580

So it is clear that the program is trying to access something it shouldn't be. Note that this is lucky - had one accidently tried to access something just outs ide the array bounds:

gdb) p f[11]
$4 = 0
(gdb) p f[1000]
$5 = 7.03598541e+22
(gdb) p f[10000]
Cannot access memory at address 0x7fc00085c0

It would have resulted in a valid number and the program would have carried on, but the results of the program would have been wrong. So one can't count on an array out of bounds to always result in a segmentation fault. Often segmentation faults occur when there are problems with pointers, since they may point to innaccessable addresses, or when a program tries to use too much memory. Using a debugger greatly helps in identifying these sorts of problems.

using core files

If a program uses a lot of memory, does not trigger an error condition in a reproducible manner, or takes a long time before it reaches the error condition then it shouldn't be debugged interactively (at least in the first instance). In these situations one should submit the debugging instrumented program to the cluster as a compute job such that it will produce a core file when it crashes. A core file contains the state of the program at the time it crashed - one can then load this file into the debugger to inspect the state and determine what caused the problem.

By default your SHARCNET environment is not configured to produce core files. To enable core files, when using the bash shell on SHARCNET systems (the default shell) one must set the core limit to be non-zero. Setting it to be unlimited should suffice, eg.

ulimit -c unlimited

then when one runs a program that crashes it should indicate that it has produced (dumped) a core file, eg.

[snuser@nar316 bugs]$ cc -g  bugs.c 
[snuser@nar316 bugs]$ ./a.out 
0.250000
Segmentation fault (core dumped)

The core file should appear in the present working directory with a name of the form core.PID , where PID is the process id of the program instance that crashed. Note: for anything more complex than the examples provided in this tutorial you should submit this as a job to the cluster, in which case the core file will be placed in the working directory used by the job but one must submit their job with the -f permitcoredump option specified to sqsub .

One can then load this into gdb as an additional argument to gdb, eg.

[snuser@nar316 bugs]$ gdb a.out core.10966 
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /hptc_cluster/sharcnet/pathscale/2.2.1/lib/2.2.1/libpscrt.so.1...done.
Loaded symbols for /opt/sharcnet/pathscale/current/lib/2.2.1/libpscrt.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
#0  0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
10	     printf("%f\n",f[q]);
(gdb) where
#0  0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
#1  0x00000000004005f3 in main (argc=1, argv=0x7fbfffe0f8) at /home/merz/bugs/bugs.c:26
(gdb) q

Note that in this case one does not need to run the program in the debugger - it will simply inspect the state of the core file and use the debugging-instrumented binary to display the type of error and where it occurs. One may then run the gdb where command to get the stack backtrace, etc., to further identify the problem.

As long as one sets their core size limit with the ulimit command before submitting their job, and submits their job with the sqsub -f permitcoredump flag, then this environment setting should propagate to their job and the program should generate a core. Keep in mind that this setting will not persist between logins, so you should either put it in your shell configuration file (eg. ~/.bash_profile ) or run it any time you log into a system if you want your programs to produce a core when they crash.

debugging interactively

If you need to view the state of the program leading up to the crash, perhaps repeatedly, then a core file won't suffice and it is suggested that one run the debugger using one of the cluster development nodes (avoid running this on the login node!). If possible, one should try to resume the program from a checkpoint that is near to the crash to avoid waiting a long time while the program reaches the erroneous state. On requin, which does not have development nodes, one may submit an interactive job as follows.

One can start gdb on a compute node interactively (requin only!) by submitting it to the test queue (1 hour runtime limit!) with sqrun as follows:

[snuser@req770 bugs]$ sqrun -t -q serial -r 1h gdb ./a.out 
submitted as jobid 409339
job starting...
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

One can then proceed to debug in the usual fashion:

r
(gdb) Starting program: /req_sfs/work/snuser/bugs/a.out 

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffd740, q=12000000) 
    at /req_sfs/work/snuser/bugs/bugs.c:10
10        printf("%f\n",f[q]);

When you exit the debugger the job will automatically terminate

 q
(gdb) [snuser@req317 bugs]$ 


Note: you may not see the (gdb} prompt, or it may appear out of order (as above), but you can proceed as though it were there.

Hungup Processors

Sometimes when a job does not complete successfully one or more processors end up hunging in the system, consuming cycles and not beeing available for other jobs. This happens often when debugging a code and using gdb in batch node.

A script to submit an mpi job for gdb to run in batch mode would look like this:

 #!/bin/bash
 
 rm -rf .gdbinit
 
 cat > .gdbinit << EOF
 r
 bt
 q
 EOF
 
 sqsub -r 180m -q mpi -n 4 -o ${CLU}_GDB_MPI_4_%J gdb ./a.out


When this job aborts all 4 processors just sit there.

You can tell that the job has aborted by using the tail command on the output file, which would print something like this:


Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 182926459616 (LWP 26716)]
0x00000000004e1fbe in dpolft (n=314, x=(4.1247861252457865),
       y=(-86.315618986822656), w=(1), maxdeg=3, ndeg=3, eps=0, r=(-0), ierr=62,
       a=(3)) at /nar_sfs/work/nickc/CFFC/src_3D/Math/Polyfit/dpolft.f:233
233 A(K5PI) = TEMD2 - R(I)


You cannot resubmit another job into the test queue because the job is still running. If you submit into the regular queue you wait longer, but either way resources are being wasted. So it is important to terminate all these processes.

One way to clean all this up is to make sure you kill all your processes after your jobs have finished.

To kill all process belonging to $USER on all nodes use the command:

  pdsh -a pkill -u $USER

To kill all process related to a particular job, first use the sqjobs command to find out the nodes associated with the job, e.g. if you are on saw and issue the sqjobs command as follows:

[nickc@saw377:/work/nickc/QD_DOC/PI_QD_MPI] sqjobs
 jobid queue state ncpus   prio    nodes time command
------ ----- ----- ----- ------ -------- ---- -------
278389  test     R    16 12.790 saw[4,8]  14s ./x.im

where above job is hung up. Then to kill this job issue command:

pdsh -w saw[4,8]  pkill -u $USER

which kills all process belonging to $USER on nodes saw[4,8]

Similarly, if your sqjobs commands reports that the job is running on nodes: saw[65-69,75] then you use the command:

   pdsh -w saw[65-69,75] pkill -u $USER

to release the nodes.

documentation

We've barely scratched the surface thus far - there are many other commands that one can use in gdb. For further information, one should consult the gdb manual page by executing man gdb . The full documentation for gdb can be found online at the gdb website.

If you would like further assistance please submit a ticket to the SHARCNET problem ticket system.

Other useful debuggers

While gdb is good for common problems with serial code, it doesn't help debug more complex problems like parallel bugs and subtle memory errors. The following tools are recommended to address these situations:

For parallel programs at SHARCNET, we recommend using the graphical DDT Debugger . It can be used to debug both threaded and MPI parallel codes, and includes additional memory checking functionality that may help diagnose more obscure bugs.

An excellent package for debugging memory related problems is VALGRIND. It includes tools that help with both debugging and profiling, including: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes one experimental tool, which detects out of bounds reads and writes of stack, global and heap arrays.

Examples

FORTRAN CODE: bugs.f C CODE: bugs.c
 program bugs
     implicit none
     real a,b
     real c(10)
     integer p
     a=0.0
     b=1.0
     do p=1,10
         c(p)=p
     enddo
     p=12000000
     call divide(a,b)
     call arrayq(c,p)
 end program
 
 subroutine divide(d,e)
     implicit none
     real d,e
     print *,e/d
 end subroutine
 
 subroutine arrayq(f,g)
     implicit none
     real f(10)
     integer g
     print *,f(g)
 end subroutine
 #include <stdio.h>
 
 void divide(float d, float e)
 {
     printf("%f\n",e/d);
 }
 
 void arrayq(float f[], int q)
 {
     printf("%f\n",f[q]);
 }
 
 int main(int argc, char **argv)
 {
     float a,b;
     float c[10];
     int p;
     a=0.0;
     b=1.0;
     for (p=0;p<10;p++)
     {
         c[p]=(float)p;
     }
     p=12000000;
    divide(a,b);
    arrayq(c,p);
    return(0);
 }