Practical CUDA

GPUs have 2-4x greater floating point and bandwidth peak for the watts
- also for the $ if you buy enterprise gear
- better for the $ if you buy gaming gear
Step 1 is to assess workload and latency requirements

Don’t waste time with GPUs if
- your problem size or time to solution requirements don’t align
- if the work you’d like to move to the GPU is not a bottleneck
- if the computation cost will be dwarfed by moving data to/from the GPU
- often you need to restructure so that caller passes in data already on the device
- can require nonlocal refactoring
Almost never: pick one kernel at a time and move it to the GPU
- DOE ACME/E3SM (to pick on one high-profile application) has basically done this for five years and it still doesn’t help their production workloads so they bought a non-GPU machine

Okay, okay, okay. What if I have the right workload?

Terminology/intro

An even easier introduction to CUDA
CUDA Programming Model
On the CPU, we have a thread with vector registers/instructions
In CUDA, we write code inside a single vector lane (“confusingly” called a CUDA thread)
To get inside the lane, we launch a kernel from the CPU using special syntax
```
add<<<numBlocks, blockSize>>>(N, x, y);
```
- needs to be compiled using nvcc
- Logically 1D/2D/3D rectangular tiled iteration space

There are many constraints and limitations to the iteration “grid”

Control flow for CUDA threads is nominally independent, but performance will be poor if you don’t coordinate threads within each block.
- Implicit coordination
- Memory coalescing
- Organize your algorithm to limit “divergence”
- Explicit coordination
- Shared memory
- __syncthreads()
- Warp shuffles
We implement the kernel by using the __global__ attribute
- Visible from the CPU
- Special built-in variables are defined
- gridDim
- blockIdx
- blockDim
- threadIdx
- There is also __device__, which is callable from other device functions
- Can use __host__ __device__ to compile two versions

How does this relate to the hardware?

Each thread block is assigned to one streaming multiprocessor (SM)
Executed in warps (number of hardware lanes)
Multiple warps (from the same or different thread blocks) execute like hyperthreads

Practical CUDA

CUDA Best Practices Guide

Occupancy

Thread instructions are executed sequentially in CUDA, and, as a result, executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Some metric related to the number of active warps on a multiprocessor is therefore important in determining how effectively the hardware is kept busy [emphasis added]. This metric is occupancy.

Reality: occupancy is just one aspect, and often inversely correlated with keeping the hardware busy (and with performance).

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.

If your kernel uses fewer registers/less shared memory, more warps can be scheduled.

def render_c(filename):
from IPython.display import Markdown
with open(filename) as f:
    contents = f.read()
return Markdown("```c\n" + contents + "```\n")

render_c('add.cu')

__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] += x[i];
}

! nvcc -arch sm_75 --resource-usage -c add.cu

ptxas info : 0 bytes gmem ptxas info : Compiling entry function ‘Z3addiPfS’ for ‘sm_75’ ptxas info : Function properties for Z3addiPfS 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 12 registers, 376 bytes cmem[0]

render_c('copy.cu')

__global__ void copy(float *dst, float *src) {
int iblock = blockIdx.x + blockIdx.y * gridDim.x;
int index  = threadIdx.x + TILE_SIZE * iblock * blockDim.x;
float a[TILE_SIZE]; // allocated in registers
for (int i=0; i<TILE_SIZE; i++)
a[i] = src[index + i * blockDim.x];
for (int i=0; i<TILE_SIZE; i++)
dst[index + i * blockDim.x] = a[i];
}

! nvcc -arch sm_75 --resource-usage -DTILE_SIZE=16 -c copy.cu

ptxas info : 0 bytes gmem ptxas info : Compiling entry function ‘Z4copyPfS’ for ‘sm_75’ ptxas info : Function properties for Z4copyPfS 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 64 registers, 368 bytes cmem[0]

The Occupancy Calculator can compute occupancy based on the register and shared memory usage.
You can tell the compiler to reduce register usage, sometimes at the expense of spills.
```
! nvcc -arch sm_75 --resource-usage -DTILE_SIZE=16 --maxrregcount 24 -c copy.cu
```
ptxas info : 0 bytes gmem ptxas info : Compiling entry function ‘Z4copyPfS’ for ‘sm_75’ ptxas info : Function properties for Z4copyPfS 80 bytes stack frame, 76 bytes spill stores, 76 bytes spill loads ptxas info : Used 24 registers, 368 bytes cmem[0]

Memory

GPU memory is not CPU memory

Duh, so why does NVIDIA publish this?

Getting your memory into position is often the hardest part of CUDA programming.

Allocate memory on the GPU

cudaMalloc(&xdevice, N*sizeof(double));

Populate it from the host

cudaMemcpy(xdevice, xhost, N*sizeof(double), cudaMemcpyHostToDevice);

Repeat for all data, including control parameters
Easy to forget, ongoing maintenance/complexity cost

Unified/managed memory

Allocate “managed” memory, accessible from CPU and GPU
```
cudaMallocManaged(&x, N*sizeof(float));
```
How?

With OpenACC, make all dynamic allocations in managed memory: pgcc -ta=tesla:managed
- The GPU probably has a lot less memory than you have DRAM
- Really convenient for incremental work in legacy code
- Performance isn’t great without cudaMemPrefetchAsync

Further reading: Maximizing Unified Memory Performance in CUDA

On memory coalescing and strided access

__global__ void strideCopy(float *odata, float* idata, int stride) {
    int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
    odata[xid] = idata[xid];
}

Lose half your bandwidth for stride=2.

Last updated on Nov 1, 2019

Edit this page