Tutorial_sn_logo


Common Bugs and Debugging with gdb

Hugh Merz, HPC Programming Specialist


Overview

This tutorial provides a concise explanation of frequently encountered bugs and how to use gdb to debug serial programs at SHARCNET. It is not comprehensive and only aims to give users enough knowledge to get started on their own.

To debug parallel programs users should consult the SHARCNET Parallel Debugging with DDT tutorial. Note that it can also be used to debug serial programs, should one desire a graphical interface.

If you would like further assistance please submit a ticket to the SHARCNET problem ticket system.

Index


Identifying bugs and errors

If your job fails, check the job output file (every job submission with sqsub should use the -o option to specify one) to make sure it doesn't contain any mysterious output. You may find a runtime error message or a signal from the operating system that helps identify the problem. If necessary, one can use a debugger to manipulate and inspect the code as it is running to identify the nature and scope of the problem.

One can also determine that a job exited in an erronous state by inspecting the job exit code in the web portal.

Back to Index

Common bugs and errors

Some frequently encountered OS signals resulting from an erronous state include:

Signal NameOS signal #OS signal nameLSF exit codeDescription
Floating point exception8SIGFPE136The program attempted an arithmetic operation with values that do not make sense (eg. divide by zero)
Segmentation fault11SIGSEGV139The program accessed memory incorrectly (eg. accessing an array beyond it's declared bounds)
Aborted6SIGABRT134Generated by the runtime library of the program or a library it uses, after having detected a failure condition

Runtime errors are more verbose than signals from the OS, allowing some problems to be resolved without the need to debug, especially if one has a thorough knowledge of the code. For an unfamiliar code, one can use the debugger to inspect the state of the program when it triggered the error.

Back to Index

Preparing your program for debugging

In order to run a program in a debugger, it should be compiled to include a symbol table. With most compilers, this means adding the -g flag to the compile line. One should also disable all processor optimzations, by specifying the -O0 flag, otherwise compiler optimizations may lead to misleading debugger behavior or obscure the bug.

An important aspect of the default compilers at SHARCNET is that they automatically include flags that allow programs to stop when they encounter an FPE. If one includes the -g flag, they will also have to explicitly include these flags. One can see what is set by default by running the compiler with the -show flag, eg.

[snuser@nar316 bugs]$ cc -show
pathcc -Wall -O3 -OPT:Ofast -fno-math-errno -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF

So to compile for debugging, one would use (for fortran, specify pathf90 instead of pathcc):

pathcc -Wall -O0 -g -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF

Note that these TENV flags are only for the Pathscale compilers (default on opteron systems). Some compilers may not include such a flag, and as such code will not fail and issue a SIGFPE after doing something non-sensical, but will leave the value as a NaN (not a number) or an Inf (infinity).

Back to Index

Using gdb

To illustrate the debugging process, there is C and Fortran example code at the end of the tutorial that includes both a floating point error and a segmentation fault. These examples are trivial, and are simply intended to show how easy it is to use the debugger. Note that the behavior of the debugger is the same regardless of the language one is using, so we'll just show the C example below.

First, to illustrate what happens when the code is run as is:

[snuser@nar317 bugs]$ cc bugs.c
[snuser@nar317 bugs]$ ./a.out
Floating point exception

Submitting this job to the queuing system produces an output file that looks like:

[snuser@nar316 bugs]$ sqsub -t -r 10m -o bugs.o ./a.out
THANK YOU for providing a runtime estimate of 10m!
submitted as jobid 122897
<wait a bit...>
[snuser@nar316 bugs]$ cat bugs.o
/opt/sharcnet/sharcnet-lsf/bin/sn_job_starter.sh: line 75: 534 Floating point exception"$@"

------------------------------------------------------------
Sender: LSF System
Subject: Job 122897: <./a.out> Exited

Job <./a.out> was submitted from host <nar316> by user <snuser>.
Job was executed on host(s) <lsfhost.localdomain>, in queue <test>, as user <snuser>.
</home/snuser> was used as the home directory.
</work/snuser/bugs> was used as the working directory.
Started at Fri Jan 23 13:51:18 2009
Results reported at Fri Jan 23 13:52:22 2009

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
./a.out
------------------------------------------------------------

Exited with exit code 136.

Resource usage summary:

CPU time : 0.26 sec.
Max Memory : 104 KB
Max Swap : 884 KB


The output (if any) is above this job summary.

[snuser@nar316 bugs]$

Notice the Floating point exception message, and the fact that it exited with code 136. To debug it in gdb:

First compile:


[snuser@nar316 bugs]$ pathcc -Wall -O0 -g -TENV:simd_zmask=OFF -TENV:simd_imask=OFF -TENV:simd_omask=OFF bugs.c

Now start the debugger, specifying the program we want to debug:

[snuser@nar316 bugs]$ gdb a.out
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

At this point, the program will be loaded, but is not running, so start it:

(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out

Program received signal SIGFPE, Arithmetic exception.
0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
5 printf("%f\n",e/d);

Note that the debugger will stop at the FPE, and show which function/routine it was in, what values input arguments had, the line number of the source file where the problem occured, and the actual line of the file.
This is sufficient to diagnose the problem: clearly e/d is undefined since the denominator is zero.
One can also look at a stack trace, to see what has been called up till this point:

(gdb) where
#0 0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24

Or look at the source code file, centered around a particular line:

(gdb) l 5
1 #include <stdio.h>
2
3 void divide(float d, float e)
4 {
5 printf("%f\n",e/d);
6 }
7
8 void arrayq(float f[], int q)
9 {
10 printf("%f\n",f[q]);

One can inspect the values of different variables in the current level of the stack:

(gdb) p d
$1 = 0
(gdb) p e
$2 = 1

Or one can go "up" the stack to look at values in the calling function/routine:

(gdb) up
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
24 divide(a,b);
(gdb) p a
$3 = 0
(gdb) p b
$4 = 1

When one is finished, it's easy to exit:

(gdb) q
The program is running. Exit anyway? (y or n) y
[snuser@nar316 bugs]$

Now, to illustrate a segfault, change the denominator to be non-zero, eg. a=4.0
Compile the modified code, and run it to see what happens:

[snuser@nar316 bugs]$ ./a.out
0.250000
Segmentation fault
[snuser@nar316 bugs]$ sqsub -t -r 10m -o bugs.1.o ./a.out
THANK YOU for providing a runtime estimate of 10m!
submitted as jobid 122902
[snuser@nar316 bugs]$ cat bugs.1.o
/opt/sharcnet/sharcnet-lsf/bin/sn_job_starter.sh: line 75: 964 Segmentation fault "$@"

------------------------------------------------------------
Sender: LSF System <lsfadmin@lsfhost.localdomain>
Subject: Job 122902: <./a.out> Exited

Job <./a.out> was submitted from host <nar316> by user <snuser>.
Job was executed on host(s) <lsfhost.localdomain>, in queue <test>, as user <snuser>.
</home/snuser> was used as the home directory.
</work/snuser/bugs> was used as the working directory.
Started at Fri Jan 23 14:18:23 2009
Results reported at Fri Jan 23 14:19:26 2009

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
./a.out
------------------------------------------------------------

Exited with exit code 139.

Resource usage summary:

CPU time : 0.25 sec.
Max Memory : 172 KB
Max Swap : 2 MB


The output (if any) is above this job summary.

Notice the Segmentation fault message, and the fact that the job exited with code 139. To debug it in gdb:

[snuser@nar316 bugs]$ gdb a.out
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
0.250000

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffe980, q=12000000) at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);

Note that the program stops automatically when it hits the segmentation fault, and shows you which function it is in, the values of the input variables, and the line in the source. One can then try printing out the values of the array, to see why it would have a problem:

(gdb) p f
$1 = (float *) 0x7fbfffe980
(gdb) p f[1]
$2 = 1
(gdb) p f[9]
$3 = 9
(gdb) p f[q]
Cannot access memory at address 0x7fc2dc5580

So it is clear that the program is trying to access something it shouldn't be. Note that this is lucky - had one accidently tried to access something just outside the array bounds:

gdb) p f[11]
$4 = 0
(gdb) p f[1000]
$5 = 7.03598541e+22
(gdb) p f[10000]
Cannot access memory at address 0x7fc00085c0

It would have resulted in a valid number and the program would have carried on, but the results of the program would have been wrong. So one can't count on an array out of bounds to always result in a segmentation fault. Often segmentation faults occur when there are problems with pointers, since they may point to innaccessable addresses, or when a program tries to use too much memory. Using a debugger greatly helps in identifying these sorts of problems.

We've barely scratched the surface thus far - there are many other commands that one can use in gdb. For further information, one should consult the gdb manual page by executing man gdb . The full documentation for gdb can be found online at the gdb website.

Back to Index

Other useful debuggers

While gdb is good for common problems with serial code, it doesn't help debug more complex problems like parallel bugs and subtle memory errors. The following tools are recommended to address these situations:

For parallel programs at SHARCNET, we recommend using the graphical DDT debugger. It can be used to debug both threaded and MPI parallel codes, and includes additional memory checking functionality that may help diagnose more obscure bugs.

An excellent package for debugging memory related problems is Valgrind. It includes tools that help with both debugging and profiling, including: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes one experimental tool, which detects out of bounds reads and writes of stack, global and heap arrays.

Back to Index

Fortran example code

program bugs
implicit none
real a,b
real c(10)
integer p
a=0.0
b=1.0
do p=1,10
  c(p)=p
enddo
p=12000000
call divide(a,b)
call arrayq(c,p)
end program

subroutine divide(d,e)
implicit none
real d,e
print *,e/d
end subroutine

subroutine arrayq(f,g)
implicit none
real f(10)
integer g
print *,f(g)
end subroutine

Back to Index

C example code

#include <stdio.h>

void divide(float d, float e)
{
    printf("%f\n",e/d);
}

void arrayq(float f[], int q)
{
    printf("%f\n",f[q]);
}

int main(int argc, char **argv)
{
    float a,b;
    float c[10];
    int p;
    a=0.0;
    b=1.0;
    for (p=0;p<10;p++) {
        c[p]=(float)p;
    }
    p=12000000;
    divide(a,b);
    arrayq(c,p);
    return 0;
}

Back to Index

© 2009, Hugh Merz, SHARCNET