## Contents

# Overview

This page contains a collection of useful tips and tricks for CUDA GPU programmers. Everyone (users and staff) is welcome to add to and modify this wiki page.

# Reduction

## Binary tree

This is the officially recommended way to perform reduction operations (sum, min, max etc. for multiple threads) in CUDA. The classic implementation is as follows:

// Reduction (min/max/avr/sum), valid only when blockDim.x is a power of two: int thread2; double temp; __shared__ double min[BLOCK_SIZE], max[BLOCK_SIZE], avg[BLOCK_SIZE], sum[BLOCK_SIZE]; int nTotalThreads = blockDim.x; // Total number of active threads while(nTotalThreads > 1) { int halfPoint = (nTotalThreads >> 1); // divide by two // only the first half of the threads will be active. if (threadIdx.x < halfPoint) { thread2 = threadIdx.x + halfPoint; // Get the shared value stored by another thread temp = min[thread2]; if (temp < min[threadIdx.x]) min[threadIdx.x] = temp; temp = max[thread2]; if (temp > max[threadIdx.x]) max[threadIdx.x] = temp; sum[threadIdx.x] += sum[thread2]; // when calculating the average, sum and divide avg[threadIdx.x] += avg[thread2]; avg[threadIdx.x] /= 2; } __syncthreads(); // Reducing the binary tree size by two: nTotalThreads = halfPoint; }

The above example will compute simultaneously minimum of all the elements in the shared array min[], maximum of max[], and average of avg[], and store the result in the first, [0] element of the respective arrays. There are some serious limitations of this approach. First, it only operates on threads within a single block (meaning a limit of 1024 threads with current hardware). Second, the block size has to be known at compile time (provided by the macro parameter BLOCK_SIZE). Third, it uses a lot of shared memory, which is very limited, so you cannot make too many reductions in one kernel. Finally, most crucially, it is actually not the most efficient way to do reductions for certain situations. E.g., if the number of threads is very small (say <20 or so), or if not every thread in the block participates in the reduction, you can actually be better off by using atomic operations (see below).

Important consideration: if you need double precision reduction (for compute capability >=2.0), or any floating point reduction (for compute capability <2.0), the binary tree is the only way to go, as there is no support for double precision in CUDA atomic operations (floating point reductions were introduced in capability 2.0). There is a workaround, but it is extremely slow, and should be reserved only for debugging/accuracy testing purposes, not for production runs.

The above example will only work if blockDim.x is a power of two. Sometimes this is not the case. Here is a simple modification which is valid for any blockDim.x.

// Reduction (min/max/avr/sum), works for any blockDim.x: int thread2; double temp; __shared__ double min[BLOCK_SIZE], max[BLOCK_SIZE], avg[BLOCK_SIZE], sum[BLOCK_SIZE]; int nTotalThreads = blockDim_2; // Total number of threads, rounded up to the next power of two while(nTotalThreads > 1) { int halfPoint = (nTotalThreads >> 1); // divide by two // only the first half of the threads will be active. if (threadIdx.x < halfPoint) { thread2 = threadIdx.x + halfPoint; // Skipping the fictious threads blockDim.x ... blockDim_2-1 if (thread2 < blockDim.x) { // Get the shared value stored by another thread temp = min[thread2]; if (temp < min[threadIdx.x]) min[threadIdx.x] = temp; temp = max[thread2]; if (temp > max[threadIdx.x]) max[threadIdx.x] = temp; sum[threadIdx.x] += sum[thread2]; // when calculating the average, sum and divide avg[threadIdx.x] += avg[thread2]; avg[threadIdx.x] /= 2; } } __syncthreads(); // Reducing the binary tree size by two: nTotalThreads = halfPoint; }

Here you will have to compute blockDim_2 (blockDim.x rounded up to the next power of two), either on device or on host (and then copy it to device). One could use the following function to compute blockDim_2, valid for 32-bit integers:

int NearestPowerOf2 (int n) { if (!n) return n; //(0 == 2^0) int x = 1; while(x < n) { x <<= 1; } return x; }

## Atomic operations

Atomic operations can be the most efficient (or the only) way to do reductions in CUDA kernels if any of the following conditions apply:

- Small number of threads (array elements subject to reduction) - say <20.
- When not every thread in the block participates in the reduction. (Say, the reduction is done only for every fifth element of an array.)
- If one doesn't know the block size (number of threads for reduction) at compile time.

Atomic operations is a wrong approach when you need double precision reductions (capability >=2.0) or floating point reductions (capability <2.0).

### Workarounds

If you really need to use double precision atomic operations with any compute capabiity, or single precision atomic operations with capability <2.0, you can use these workaround functions. Be aware: they are extremely (orders of magnitude) slower than proper atomic functions or binary tree reduction. They should only be used for debugging (say, debugging a capability 2.0 code on capability 1.3 GPU), or accuracy testing (to determine if single precision atomic reduction is accurate enough).

Single precision atomic add function:

__device__ inline void MyAtomicAdd (float *address, float value) { int oldval, newval, readback; oldval = __float_as_int(*address); newval = __float_as_int(__int_as_float(oldval) + value); while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) { oldval = readback; newval = __float_as_int(__int_as_float(oldval) + value); } }

Double precision atomic add function:

__device__ inline void MyAtomicAdd_8 (double *address, double value) { unsigned long long oldval, newval, readback; oldval = __double_as_longlong(*address); newval = __double_as_longlong(__longlong_as_double(oldval) + value); while ((readback=atomicCAS((unsigned long long *)address, oldval, newval)) != oldval) { oldval = readback; newval = __double_as_longlong(__longlong_as_double(oldval) + value); } }

### Writing universal CUDA/serial code with atomic operations

It is very convenient to use C macro language to write a code which can be compiled for different situations: for compute capability <2.0, >=2.0, accuracy testing (double precision atomic operations), and even serial (non-CUDA) code. Here is an example of how it can be set up.

#ifdef ATOMIC_8 #define FLOAT double #else #define FLOAT float #endif #ifdef CUDA #define DEVICE __device__ #ifdef OLDGPU // For capability <2.0 (but >=1.3) GPUs #define ATOMIC_ADD(x,y) MyAtomicAdd (x, (float)(y)) #else #define ATOMIC_ADD(x,y) atomicAdd (x, (float)(y)) #endif #define ATOMIC_MIN atomicMin #define ATOMIC_MAX atomicMax // Double presicion atomic add (super-slow, needed for accuracy testing only): #ifdef ATOMIC_8 #define ATOMIC_ADD(x,y) MyAtomicAdd_8 (x, (double)(y)) #endif #else // Fake atomics for serial version: #define ATOMIC_ADD(x,y) SerialAtomicAdd (x, (float)(y)) #define ATOMIC_MIN SerialAtomicMin #define ATOMIC_MAX SerialAtomicMax #ifdef ATOMIC_8 #define ATOMIC_ADD(x,y) SerialAtomicAdd_8 (x, (double)(y)) #endif #endif #ifdef CUDA DEVICE inline void MyAtomicAdd(float *address, float value) { int oldval, newval, readback; oldval = __float_as_int(*address); newval = __float_as_int(__int_as_float(oldval) + value); while ((readback=atomicCAS((int *)address, oldval, newval)) != oldval) { oldval = readback; newval = __float_as_int(__int_as_float(oldval) + value); } } DEVICE inline void MyAtomicAdd_8(double *address, double value) //See CUDA official forum { unsigned long long oldval, newval, readback; oldval = __double_as_longlong(*address); newval = __double_as_longlong(__longlong_as_double(oldval) + value); while ((readback=atomicCAS((unsigned long long *)address, oldval, newval)) != oldval) { oldval = readback; newval = __double_as_longlong(__longlong_as_double(oldval) + value); } } #else inline void SerialAtomicAdd(float *address, float value) { *address = *address + value; return; } inline void SerialAtomicAdd_8(double *address, double value) { *address = *address + value; return; } inline void SerialAtomicMin(int *address, int value) { if (value < *address) *address = value; return; } inline void SerialAtomicMax(int *address, int value) { if (value > *address) *address = value; return; } #endif

Once you make the above definitions for atomic floating point add, and integer min and max (more operations can be added in a similar fashion), they can be used throughout the code in the following manner:

FLOAT sum, x; integer min, max, i, j; ATOMIC_ADD (&sum, x); ATOMIC_MIN (&min, i); ATOMIC_MAX (&max, j);

The code compilation scenarios are as follows:

- If you compile your code with "nvcc -D CUDA", it will be compiled for capability >=2.0 GPUs, with proper (single precision) atomic operations.
- "nvcc -D CUDA -D OLDGPU": will compile for capability <2.0, single precision. (Only for debugging/testing).
- "nvcc -D CUDA -D ATOMIC_8": for capability >=2.0, will use a super-slow double precision workaround (for accuracy testing only).
- "nvcc -D CUDA -D OLDGPU -D ATOMIC_8": for capability <2.0, will use a super-slow double precision workaround (for accuracy testing only).
- "cc": will compile a serial version (without CUDA), with single precision add.
- "cc -D ATOMIC_8": will compile a serial version (without CUDA), with double precision add.