This online tutorial covers the material contained in the online seminar given in October 2017. The recording is available here.

## Why do linear algebra on the GPU?

• Since around 2012 many top of the line HPC systems rely on GPUs for their computing power.
• New Compute Canada systems (graham and cedar) acquired in 2017 have many GPUs.
• Graham has 160 GPU nodes (each 32 core, 128 GB RAM, 2 Tesla P100 GPUs)
• Cedar has 114 GPU nodes (each 32 core, 128 GB RAM, 4 Tesla P100 GPUs)
• Cedar also has 32 GPU large memory nodes (each 32 core, 256 GB RAM, 4 Tesla P100 GPUs)
• older GPUS systems are also still available

Given the availability of these resources, it's worthwhile to consider performing your linear algebra calculations on GPUs.

## GPU vs CPU hardware

CPU is designed to be a "jack-of-all-trades" device, able to do multiple, diverse tasks with low latency.

GPU is designed exclusively for number crunching, and is idea on performing a single task on multiple pieces of data.

Programming the GPU is more challenging, given the restrictions on its hardware. The programmer must decompose the problem into a large number of threads, with each thread ideally executing the same instruction but different data.

Fortunately, using linear algebra libraries avoids the problem of writing CUDA code to perform the operations.

## Memory bottleneck between GPU and CPU

The memory bandwidth between the GPU and CPU is limited, compared with the bandwidth to the memory of both the GPU and CPU. That means that for GPU calculation to be worthwhile, sufficient calculation must be done on the GPU to offset the time it takes to copy the data.

## Available libraries

• Standard CPU libraries are BLAS, LAPACK
• the GPU analogues are CUBLAS and MAGMA
• CUSPARSE is also available for sparse matrices

## CUBLAS

• Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA
• Included with CUDA (hence free)
• Workflow:
• allocate vectors and matrices in GPU memory
• fill them with data
• call sequence of CUBLAS functions
• transfer results from GPU memory to host
• Helper functions to transfer data to/from GPU provided

## Data layout

• for maximum compatibility with existing Fortran environments, CUBLAS library uses column-major storage and 1-based indexing (C/C++ uses row-major storage and 0-based indexing)
• use macro to convert
```#define IDX2F(i,j,ld) ((((j)-1)*(ld)) + ((i)-1))
```
• CUBLAS library can be used by applications written in Fortran, via wrappers
• in CUDA 4 and later, CUBLAS has a new API, uses header file cublas_v2.h. It is generally improved and simplified, as well as better suited to multi-GPU cases.
• in CUDA 6 and later, CUBLAS has support for multiple GPUs
• supports scaling across multiple  GPUs connected to same motherboard
• linear performance scaling with number  of GPUs
• matrix has to be distributed among  the GPU memory spaces

## CUBLAS example

### Error checks

• In following example most error checks were removed for clarity
• each CUBLAS function returns a status object containing information about possible errors
• It’s very important these objects to catch errors, via calls like this:
```if (status != CUBLAS_STATUS_SUCCESS) { print diagnostic information and exit}
```

### Matrix multiplication with CUBLAS - source code

```/* Includes, system */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Includes, cuda */
#include <cuda_runtime.h>
#include <cublas_v2.h>

/* Matrix size */
#define N  (275)

/* Main */
int main(int argc, char** argv)
{
cublasStatus_t status;
float* h_A;
float* h_B;
float* h_C;
float* d_A = 0;
float* d_B = 0;
float* d_C = 0;
float alpha = 1.0f;
float beta = 0.0f;
int n2 = N * N;
int i;
cublasHandle_t handle;

/* Initialize CUBLAS */

status = cublasCreate(&handle);

/* Allocate host memory for the matrices */
h_A = (float*)malloc(n2 * sizeof(h_A[0]));
h_B = (float*)malloc(n2 * sizeof(h_B[0]));

/* Fill the matrices with test data */
for (i = 0; i < n2; i++) {
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
}

/* Allocate device memory for the matrices */
if (cudaMalloc((void**)&d_A, n2 * sizeof(d_A[0])) != cudaSuccess) {
fprintf (stderr, "!!!! device memory allocation error (allocate A)\n");
return EXIT_FAILURE;
}
if (cudaMalloc((void**)&d_B, n2 * sizeof(d_B[0])) != cudaSuccess) {
fprintf (stderr, "!!!! device memory allocation error (allocate B)\n");
return EXIT_FAILURE;
}
if (cudaMalloc((void**)&d_C, n2 * sizeof(d_C[0])) != cudaSuccess) {
fprintf (stderr, "!!!! device memory allocation error (allocate C)\n");
return EXIT_FAILURE;
}

/* Initialize the device matrices with the host matrices */
status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);

/* Performs operation using cublas */
status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

/* Allocate host memory for reading back the result from device memory */
h_C = (float*)malloc(n2 * sizeof(h_C[0]));

/* Read the result back */
status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

/* Memory clean up */
free(h_A);
free(h_B);
free(h_C);
if (cudaFree(d_A) != cudaSuccess) {
fprintf (stderr, "!!!! memory free error (A)\n");
return EXIT_FAILURE;
}
if (cudaFree(d_B) != cudaSuccess) {
fprintf (stderr, "!!!! memory free error (B)\n");
return EXIT_FAILURE;
}
if (cudaFree(d_C) != cudaSuccess) {
fprintf (stderr, "!!!! memory free error (C)\n");
return EXIT_FAILURE;
}

/* Shutdown */
status = cublasDestroy(handle);

return EXIT_SUCCESS;
}```

### CUBLAS performance

• multiplying single-precision (SP)/double precision (DP) matrices (24576 x 32) and (32 x 24576), result is a (24576 x 24576) matrix
• CPU Intel MKL library (11.3.4) - single core and threaded
• CUBLAS library (CUDA 8.0 version)
• GPU timing results with and without memory transfer
• tested on graham P100 Tesla (1 GPU)
• cudaMallocHost used to allocate CPU memory
• NOTE: non-Tesla cards generally offer poor DP performance (DP calculations 10x slower)

Timing results:

SP time (s) DP time (s)
CPU - 1 core 1.53 3.07
CPU - 16 core 0.13 0.25
CUBLAS - without memory transfer 0.021 0.025
CUBLAS - with memory transfer 0.43 0.62

Same data expressed as speedup relative to CPU-only multithreaded result

SP time (s) DP time (s)
CPU - 1 core 0.08x 0.08x
CPU - 16 core 1.0x 1.0x
CUBLAS - without memory transfer 6.2x 10.0x
CUBLAS - with memory transfer 0.3x 0.4x

This clearly shows that doing a single operation on the GPU does not make sense as the data movement between GPU and CPU takes too long. However, if one does multiple operations keeping the data on the GPU, then it is worthwhile.

## CUSPARSE

• included with CUDA (so it's free)
• set of basic linear algebra subroutines for handling sparse matrices on top of CUDA
• analogue of CUBLAS, but more basic in functionality

## MAGMA library

• The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current "Multicore+GPU" systems.
• Website: [1]
• free and open source
• implementations for CPU, NVIDIA GPU, Intel Phi, OpenCL
• check MAGMA documentation for which routines implemented where
• multi-GPU support for certain routines

## MAGMA library on Compute Canada systems

MAGMA is installed as a module on new clusters. It can also be compiled from source if one needs the testing suite.

### Magma example

To try out the example included with magma, using already installed MAGMA module, download and unpack the tar file:

```wget http://icl.cs.utk.edu/projectsfiles/magma/downloads/magma-2.2.0.tar.gz
tar xvfz magma-2.2.0.tar.gz
cd magma-2.2.0/example
```

To compile the examples, first edit the Makefile to contain lines:

```MAGMADIR     = \$(EBROOTMAGMA)
OPENBLASDIR  = \$(EBROOTOPENBLAS)
```

and

```LDFLAGS       = -Wall -lifcore #-fopenmp
```

Next, load the modules and compile:

```module load cuda/8.0.44
make c
```

To run the example, first request a session on an interactive GPU node:

```salloc --time=1:00:0 --nodes=1 --ntasks=1 --cpus-per-task=32 --gres=gpu:2   --account=def-\$USER
```

Once that starts, run the example with:

```./example_v1
```

Below is the part of the example code that actually call the MAGMA routine to perform the linear algebra operation.

```// Solve dA * dX = dB, where dA and dX are stored in GPU device memory.
// Internally, MAGMA uses a hybrid CPU + GPU algorithm.
void gpu_interface( magma_int_t n, magma_int_t nrhs )
{
magma_int_t *ipiv=NULL;
magma_int_t ldda = ((n+31)/32)*32;  // round up to multiple of 32 for best GPU performance
magma_int_t lddx = ldda;
magma_int_t info = 0;

// magma malloc (GPU) routines are type-safe,
// but you can use cudaMalloc if you prefer.
magma_zmalloc( &dA, ldda*n );
magma_zmalloc( &dX, lddx*nrhs );
magma_imalloc_cpu( &ipiv, n );  // ipiv always on CPU
if ( dA == NULL || dX == NULL || ipiv == NULL ) {
fprintf( stderr, "malloc failed\n" );
goto cleanup;
}

zfill_matrix_gpu( n, n, dA, ldda );
zfill_rhs_gpu( n, nrhs, dX, lddx );

magma_zgesv_gpu( n, 1, dA, ldda, ipiv, dX, ldda, &info );
if ( info != 0 ) {
fprintf( stderr, "magma_zgesv_gpu failed with info=%d\n", info );
}
cleanup:
magma_free( dA );
magma_free( dX );
magma_free_cpu( ipiv );
}```

## Building MAGMA

To use the extensive testing suite, MAGMA has to be compiled from source. The instructions below were tested on graham in January, 2018.

```module load cuda/8.0.44
unset ARCH
export GPU_TARGET=Pascal
tar xvfz magma-2.2.0.tar.gz
cd magma-2.2.0
cp make.inc-examples/make.inc.mkl-icc make.inc
make
cd testing
```

### MAGMA performance

To run the benchmark, first request a session on an interactive GPU node:

```salloc --time=1:00:0 --nodes=1 --ntasks=1 --cpus-per-task=32 --gres=gpu:2   --account=def-\$USER
```

Once that starts, run the benchmark (here with 32 threads):

```OMP_NUM_THREADS=32  ./testing_zgesv -l -N 1088:14400:1024
```

The options after -N indicate the minimum, maximum, and step variable respectively, for trying matrices of different sizes.

• ZGESV (complex linear system solver) from the testing suite
• Full CPU (red)
• 1 GPU (blue)
• 2 GPU (green)