From Documentation
Jump to: navigation, search

Overview

This tutorial provides information about GPU resources at SHARCNET and an introduction to the use of GPUs for performing general purpose calculations. If you simply want to get started playing with code, the recommended starting point is here.

Prerequisites

Technically, one should understand the software development process and know how to use the bash shell.

To write GPU accelerated programs, one will need to be familiar with high-level programming languages. Most GPU programming is based on the C language and its extensions.

In the wider context, having a background in parallel computing techniques (threading, message passing, vectorization) will help one understand and apply GPU acceleration.

SHARCNET GPU Systems

monk.sharcnet.ca

Monk is the new GPU cluster opened to users in April, 2012. It replaces angel as SHARCNET's primary GPU cluster.

Monk has 54 compute nodes, each with 2 Tesla M2070 GPU cards. For more information, go to the monk page in this wiki.

angel.sharcnet.ca

Angel is a cluster containing 11 NVIDIA Tesla s1070 GPU servers. As monk cluster is now online, angel will be reconfigured in the near future. For more information, go to the angel page in this wiki.

Visualization Workstations

We also have a number of visualization workstations that contain the latest desktop-oriented GPUs and have GPGPU toolkits installed. Please see Visualization_Workstations for further details.

Submitting GPGPU jobs at SHARCNET

Interactive

The visualization workstations are used interactively with no job scheduler.

Queued

All GPU jobs on angel and monk should be queued via the job scheduler.

Requesting GPUs

The key scheduling constraint is to prevent jobs from sharing GPUs. To ensure this on angel always use the --gpp=() flag in sqsub and to ensure higher priority for GPU jobs, request the gpu queue.

The value of () to select depends on how many GPUs per node you'd like to use. On Angel and Monk this is either 1 or 2 GPUs per node.

NOTE: Scheduling GPUs correctly is tricky if you are doing anything more complicated that a straightforward single process/single GPU run. If you are using multiple cards/processes/threads, your program must be written in such a way that it can utilize these resources. In all the commands below, the -v option is included, so that the user can gain some insight into the actual command being used. Please check your program and the output of sqjobs command to make sure your jobs are running in the way you desire them to.

Serial Example

Serial jobs should request 1 or 2 GPUs units to get 1 or 2 GPUs, respectively.

  • eg (serial job, 1 CPU, 1 process,1 thread, using 1 GPU):

sqsub -v -q gpu --gpp=1 -n1 -N1 -r 1h -o <OUTFILE> <JOB>

  • eg (serial job, 1 CPU, 1 process, 1 thread, using 2 GPU):

sqsub -v -q gpu --gpp=2 -n1 -N1 -r 1h -o <OUTFILE> <JOB>

These commands will work even without -n1 -N1 since, when these are not present, sqsub will just use the default value, which is 1 for both flags.

Threaded Example

Threaded jobs should request 1 or 2 GPUs units to get 1 or 2 GPUs, respectively.

  • eg. (1 process, 4 threads, 1 GPU):

sqsub -v -q gpu -f threaded --gpp=1 -n4 -N1 -r 1h -o <OUTFILE> <JOB>

  • eg. (1 process, 8 threads, 2 GPU):

sqsub -v -q gpu -f threaded --gpp=2 -n8 -N1 -r 1h -o <OUTFILE> <JOB>

In this case -N1 could have been omitted since a threaded job will always run on one node by default.

MPI Example

MPI jobs should request 1 or 2 GPUs units to get 1 or 2 GPUs per node, respectively.

eg: sqsub -v -q gpu -f mpi --gpp=1 -n2 -N2 -r 1h -o <OUTFILE> <JOB>

MPI/Threaded Example

Some programs (eg. NAMD,Gromacs) benefit from using whole nodes, and all the CPUs and GPUs available on a node. For example, on monk nodes with 8 CPUs and 2 GPUs on each node, it is beneficial for some software to run 2 MPI processes per node, each using 4 threads each running on a separate core and one GPU.

  • eg. job uses 32 cpus and 8 GPUs on 4 nodes in total, creating 8 MPI processes, 2 per node, each using 4 threads:

sqsub -q gpu -f threaded -f mpi -N 4 -n 32 --gpp=2 --tpp=4 ......

In this case -N indicates 4 nodes are required. This flag will ensure whole nodes are used by the job. --tpp flag actually sets the OMP_NUM_THREADS variable, which should be the number of threads each MPI process launches. The number of MPI processes will the the value in -n divided by the value in --tpp. As there are 2 MPI processes per node, each would then start communicating with one GPU card. It is the responsibility of the program the user is running that the MPI processes connect to the right GPU. Most GPU enabled software will do this automatically.

GPU Programming Overview

GPU is an acronym for graphics processing unit. It is a special-purpose co-processor that helps the traditional central processor (CPU) with graphical tasks. A GPU is designed to process parallel data streams at an extremely high throughput. It does not have the flexibility of a typical CPU, but it can speed up some calculations by over an order of magnitude. Recent architectural decisions have made the GPU more flexible, and new general purpose software stacks allow them to be programmed with far greater ease.

The use of GPUs in HPC is targeted at data-intensive applications which spend nearly all of their time running mathematical kernels that are amenable to SIMD operations. These kernels must exhibit finely-grained parallelism, both in terms of being able to process many independent steams of data as well as pipelining operations on each stream. This emphasis on data-parallelism means that GPUs will not aid programs that are constrained by Ahmdal's Law or which require the use of complex memory structures or data access patterns.

There are multiple SDKs and APIs available for programming GPUs for general purpose computation, including NVIDIA CUDA, ATI Stream SDK, OpenCL, Rapidmind, HMPP, and PGI Accelerator. Selecting the right approach for accelerating your program will depend on a number of factors, including which language you're currently using, portability, supported software functionality, and other considerations depending on your project.

At present CUDA is the predominant method for GPGPU acceleration, although it is only supported by NVIDIA GPUs. In the longer term OpenCL promises to become the vendor-neutral standard for programming heterogeneous multicore architectures. As it is largely based on CUDA, there will hopefully be relatively little difficulty making the switch from CUDA's proprietary API to OpenCL.

Where to find further information

The best place to read about GPU computing and GPGPU is the GPGPU.org website. It lists many publicly available tutorials, events and research papers that may be of interest to researchers interested in this field.

NVIDIA CUDA

CUDA, or "Compute Unified Device Architecture" is an NVIDIA SDK and associated toolkit for programming GPUs to perform general purpose computation, implemented as an extension to standard C. As well as a fully featured API for parallel programming on the GPU, CUDA includes standard numerical libraries, including FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines), a visual profiler, and numerous examples illustrating the use of the CUDA in a wide variety of applications.

Another CUDA library, CUDPP (CUDA Data Parallel Primitives) may be useful for developing CUDA programs and includes data-parallel algorithm primitives such as parallel prefix-sum (”scan”), parallel sort and parallel reduction.

For up to date information, tutorials, forums, etc. see the official NVIDIA CUDA site.

SDK Usage

The SDK is a good way to learn about CUDA, one can compile the examples and learn how the tool kit works. As we have a mix of different systems, all with varying capabilities (such is life on the bleeding edge), it is suggested that users try the supported versions on the corresponding systems. We will make these available.

It's worth noting that one can run CUDA programs in emulation on their x86 cpu, making code development possible even if one does not have access to a supported GPU.

The remainder of this section walks the user through compiling and running GPU accelerated jobs on Angel, the Tesla S1070 (compute capability 1.3) cluster and Monk, the Tesla M2070 (compute capability 2.0) cluster. These currently have the CUDA v5.0 SDK and tool kit installed as a software module. As such one should not need to set any CUDA configuration in their ~/.bashrc shell configuration file. One may find details about the versions installed on Angel via the SHARCNET web portal CUDA software page.

Documentation for the default version of CUDA on the system can be found in:

 /opt/sharcnet/cuda/5.0.35/toolkit/doc

To compile any of the examples in the sdk, one should perform the following (or similar) steps. Here we have selected the deviceQuery example, which will return information about the GPU. First log in to the system's login node and copy the SDK to your personal directory (in this example it's in /scratch/$USER but could also use /work/$USER ):

ssh monk.sharcnet.ca   
cp -R /opt/sharcnet/cuda/5.0.35/samples /scratch/$USER/gpucomputingsdk
cd /scratch/$USER/gpucomputingsdk/
make

This compilation will take a long time as all examples are compiled. If you want to quickly compile one example, just switch into its directory and run make there. For example:

cd /scratch/$USER/gpucomputingsdk/1_Utilities/deviceQuery
make

will compile deviceQuery program only.

Now you should find the deviceQuery binary in:

/scratch/$USER/gpucomputingsdk/bin/linux/release

In fact all the binaries for sdk examples will be there, as we have already compiled them for you.

Note that jobs on angel are not interactive, so the sdk examples that produce graphics output (via OpenGL) will not work. To try out those examples, please use one of our visualization workstations on which it is possible to run interactively. You can see these workstations listed on our systems page here. Some of these have AMD cards and will not support CUDA, so please check their specifications before you select one to work on.

You can then submit this to the system as a job to verify that it works properly:

 sqsub -q gpu --gpp=1 -r 10m -o CUDA_DQ_TEST /scratch/$USER/gpucomputingsdk/bin/linux/release/deviceQuery

This should list (on monk) 2 Tesla GPUs in the CUDA_DQ_TEST job output file:

deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla M2070"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    2.0
 Total amount of global memory:                 5375 MBytes (5636554752 bytes)
 (14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
 GPU Clock rate:                                1147 MHz (1.15 GHz)
 Memory Clock rate:                             1566 Mhz
 Memory Bus Width:                              384-bit
 L2 Cache Size:                                 786432 bytes
 Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 32768
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1536
 Maximum number of threads per block:           1024
 Maximum sizes of each dimension of a block:    1024 x 1024 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Enabled
 Device supports Unified Addressing (UVA):      Yes
 Device PCI Bus ID / PCI location ID:           20 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla M2070"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    2.0
 Total amount of global memory:                 5375 MBytes (5636554752 bytes)
 (14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
 GPU Clock rate:                                1147 MHz (1.15 GHz)
 Memory Clock rate:                             1566 Mhz
 Memory Bus Width:                              384-bit
 L2 Cache Size:                                 786432 bytes
 Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
 Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 32768
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  1536
 Maximum number of threads per block:           1024
 Maximum sizes of each dimension of a block:    1024 x 1024 x 64
 Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Enabled
 Device supports Unified Addressing (UVA):      Yes
 Device PCI Bus ID / PCI location ID:           21 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla M2070, Device1 = Tesla M2070


The other examples can be submitted to run on the cluster in a similar fashion.

Compiling double precision CUDA kernels

When compiling CUDA kernels containing double precision variables the compiler may produce error messages of the following form:

 warning : Double is not supported. Demoting to float

This is because when architecture is not specified the NVIDIA compiler (nvcc) is very conservative and targets 32-bit hardware by default. If you are compiling CUDA code for angel then you will need to add the following additional compiler flag to enable extended support for things like double precision:

 -arch sm_13

This indicates that you want to produce a GPU binary targeting compute capability 1.3 hardware (eg. s1070, as exists in angel).

On newer hardware (i.e. that in monk) use the flag

 -arch=sm_20

to compile specifically for its compute capability.

Debugging CUDA programs

CUDA provides a debugger called cuda-gdb, broadly similar to standard GNU gdb with added functionality for GPU cards. For the debugger to work the GPU card cannot be running a display. Thus it will work on SHARCNET's angel cluster but it will not work on our visualization stations, unless they are the few that are not connected to displays.

To compile with debugging information, use the flags:

nvcc -g -G --keep

The --keep flag will keep a number of intermediate files generated by the compiler which the debugger will look for. Please note that staff have experienced problems code compiled with -G not running properly on earlier generation hardware (compute capability < 2.0).

Using cuda-gdb on angel requires and interactive session. The recommended way to do this is submit a screen sessions as a job.

sqsub -q gpu -r 

Once the job begins running, figure out what compute node it has launched on

sqjobs

and then ssh to this node and attach to the running screen session

ssh -t <NODE> screen -r

You can access screens options via the ctrl+a key stroke. Some examples are ctrl+a ? to bring up help and ctrl+a a to send a ctrl+a. See the screen man page (man screen) for more information. The message Suddenly the Dungeon collapses!! - You die... is screen's way of telling you it is being killed by the scheduler (most likely because the time you specified for the job has elapsed). The exit command will terminate the session.

If your jobs is a MPI GPU job, the screen submission should be

sqsub -q gpu -f mpi --nompirun -n <NODES> -r <TIME> --gpp=<GPP> screen -D -m bash

Once the job starts, the screen sessions will be launch screen on the rank zero node. This may not be the lowest number node allocated, so you have to run

qstat -f -l <JOBID> | egrep exec_host

to find out what node it is (the first one listed). You can then proceed as in the non-mpi case. The command pbsdsh -o <COMMAND> can be used to run commands on all the allocated nodes (see the man pbsdsh), and the command mpirun <COMMAND> can be used to start MPI programs on the nodes.

Using CUDA on multiple GPUs from the same host

One accesses multiple GPUs from the same host program by using the cudaSetDevice function to select which GPU each thread should use for launching its kernels and transferring memory, etc.

This CUDA example illustrates the use of multiple GPUs per host process on angel for CUDA 5.0 on monk:

 /opt/sharcnet/cuda/5.0.35/samples/0_Simple/simpleMultiGPU

IMPORTANT Note: On angel/monk one should usually not use the cudaSetDevice function; it's GPUs are configured in compute exclusive mode and as such the GPU driver will allocate and lock GPUs to host processes on a first come, first serve basis. This prevents sharing which is normally avoided with cudaSetDevice. However, cudaSetDevice will be needed if your job has requested more than one GPU on a given node; to fully utilize the allocated resources, you will then need cudaSetDevice to send kernels to multiple GPUs.

CUDA + MPI hybrid programs

Writing programs that use MPI and CUDA is relatively straightforward if one understands how to use both. Intensive functions in an MPI program can be accelerated using CUDA in the same fashion as for a serial programs, and multiple GPUs can be accessed per node in the same fashion as for serial programs by using threads. The key difference is how one compiles and links the programs.

All CUDA functions should be put in their own .cu files, and declared with extern "C". These can then be compiled with nvcc -c to create .o object files, eg.

 nvcc -c cuda_kernel.cu

Then one compiles the MPI program and links in the CUDA object files, by invoking the mpi wrapper compiler with the additional specification of the CUDA runtime library (our environment modules will set correct LDFLAGS and LD_RUN_PAT to current CUDA version automatically):

 mpicc program.c $LDFLAGS -lcudart  -Wl,-rpath=$LD_RUN_PATH cuda_kernel.o   
 mpiCC program.cpp $LDFLAGS -lcudart  -Wl,-rpath=$LD_RUN_PATH cuda_kernel.o    
 mpif77 program.f $LDFLAGS  -lcudart  -Wl,-rpath=$LD_RUN_PATH cuda_kernel.o

IMPORTANT Note: On angel/monk one should usually not use the cudaSetDevice function; it's GPUs are configured in compute exclusive mode and as such the GPU driver will allocate and lock GPUs to host processes on a first come, first serve basis. This prevents sharing which is normally avoided with cudaSetDevice. However, cudaSetDevice will be needed if your job has requested more than one GPU on a given node; to fully utilize the allocated resources, you will then need cudaSetDevice to send kernels to multiple GPUs.

Further Information

To learn more about using CUDA at SHARCNET, one should check the SHARCNET CUDA Software Page.

The following external links are good places to learn more about CUDA:

Other GPGPU Software

Please email help@sharcnet.ca if you are interested in using any of the following at SHARCNET.

Software officially supported by SHARCNET

These packages are officially supported and are listed in the SHARCNET web portal software listing.

OpenCL

Framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors, loosely based on NVIDIA CUDA.

PGI Accelerator

Using PGI Accelerator compilers, programmers can accelerate Linux applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing high-level standard-compliant Fortran and C programs and then recompiling with appropriate compiler options.

Software which is supported in a limited fashion by SHARCNET

While we do not officially support and maintain these packages, we can readily provide assistance with their use on our systems.

PyCUDA

PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python.

PyOpenCL

PyOpenCL lets you access OpenCL parallel computation API from Python.

Other software packages of potential interest

Most of these packages / frameworks are commercial software and are not supported in an ongoing basis by SHARCNET and are listed here as suggestions. We will be happy to help you determine if they can meet your needs.

AMD Stream SDK

Stream programming model, using the ATI Brook+ language. Now implemented as version 2.0 with OpenCL v1.0 support.

AMD provides a webinar series which may be of interest.

RapidMind

A software development platform for C++ that includes types and operations used to express primarily data parallel computation. RapidMind was purchased by Intel in 2009 and the technology most recently appeared in their Array Building Blocks software.

CAPS HMPP

Similar to PGI Accelerator.

Theano

Python library for expressing matrix computations at a functional level. Transparent use of CPU or GPU via dynamic code generation and compilation. Uses CUDA internally. Open Source with BSD-style license. Higher level than PyCUDA because Theano generates the kernels for you for a wide variety of mathematical operations and expressions.

Swan

Swan is a small tool that aids the reversible conversion of existing CUDA codebases to OpenCL.

CLyther

CLyther is a python tool similar to Cython. CLyther is a python language extension that makes writing OpenCL code as easy as Python itself. CLyther currently only supports a subset of the Python language definition but adds many new features to OpenCL.

Jacket

Jacket is a software platform for running Matlab code on GPUs. It was incorporated in the MATLAB Parallel Computing Toolbox in 2013.

CULA

CULA is a GPU-accelerated linear algebra library that utilizes CUDA. It implements many standard Lapack routines and does not require any CUDA experience.

See the linear algebra on the GPU page for more information.

OpenMM

OpenMM is a library which provides tools for modern molecular modelling simulation with a strong emphasis on hardware acceleration. It is used in popular packages like GROMACS and AMBER.

OpenNL

Open Numerical Library is for solving sparse linear systems on CPUs and GPUs.

HOOMD-blue

HOOMD-blue stands for Highly Optimized Object-oriented Many-particle Dynamics -- Blue Edition. It performs general purpose particle dynamics simulations on a single workstation, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many processor cores on a fast cluster.

PyCULA

PyCULA is a module providing PyCUDA bindings for CULA