def render_c(filename):
from IPython.display import Markdown
with open(filename) as f:
contents = f.read()
return Markdown("```c\n" + contents + "```\n")
What is OpenMP?
A community-developed standard Application Programming Interface (with “directives”) for * multithreaded programming * vectorization * offload to coprocessors (such as GPUs)
OpenMP is available for C, C++, and Fortran.
Latest version: OpenMP-5.0, released November 2018. Implementations are still incomplete!
OpenMP Resources
- OpenMP-5.0 Reference Cards (a few pages, printable)
- OpenMP-5.0 Standard
- OpenMP-4.5 Examples
- LLNL Tutorial
- Mattson: The OpenMP Common Core from ATPESC (video)
#pragma omp parallel
The standard is big, but most applications only use a few constructs.
render_c('omp-hello.c')
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel
{
int num_threads = omp_get_num_threads();
int my_thread_num = omp_get_thread_num();
printf("I am %d of %d\n", my_thread_num, num_threads);
}
return 0;
}
!make CFLAGS='-fopenmp -Wall' -B omp-hello
cc -fopenmp -Wall omp-hello.c -o omp-hello
!./omp-hello
I am 1 of 4
I am 2 of 4
I am 0 of 4
I am 3 of 4
!OMP_NUM_THREADS=8 ./omp-hello
I am 0 of 8
I am 7 of 8
I am 1 of 8
I am 3 of 8
I am 4 of 8
I am 6 of 8
I am 2 of 8
I am 5 of 8
Parallelizing triad
void triad(int N, double *a, const double *b, double scalar, const double *c) {
#pragma omp parallel
{
for (int i=0; i<N; i++)
a[i] = b[i] + scalar * c[i];
}
}
What does this code do?
void triad(int N, double *a, const double *b, double scalar, const double *c) {
#pragma omp parallel
{
int id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
for (int i=id; i<N; i+=num_threads)
a[i] = b[i] + scalar * c[i];
}
}
Parallelizing dot
static double dot_ref(size_t n, const double *a, const double *b) {
double sum = 0;
for (size_t i=0; i<n; i++)
sum += a[i] * b[i];
return sum;
}
!make CFLAGS='-O3 -march=native -fopenmp' -B dot
cc -O3 -march=native -fopenmp dot.c -o dot
!OMP_NUM_THREADS=2 ./dot -r 10 -n 10000
Name flops ticks flops/tick
dot_ref 20000 40327 0.50
dot_ref 20000 35717 0.56
dot_ref 20000 36096 0.55
dot_ref 20000 36487 0.55
dot_ref 20000 37157 0.54
dot_ref 20000 36024 0.56
dot_ref 20000 35322 0.57
dot_ref 20000 36601 0.55
dot_ref 20000 72193 0.28
dot_ref 20000 37924 0.53
dot_opt1 20000 51256384 0.00
dot_opt1 20000 23343145 0.00
dot_opt1 20000 4646174 0.00
dot_opt1 20000 16710 1.20
dot_opt1 20000 15512 1.29
dot_opt1 20000 16016 1.25
dot_opt1 20000 16982 1.18
dot_opt1 20000 452064 0.04
dot_opt1 20000 16278 1.23
dot_opt1 20000 16311 1.23
dot_opt2 20000 24616 0.81
dot_opt2 20000 16095 1.24
dot_opt2 20000 17561 1.14
dot_opt2 20000 16270 1.23
dot_opt2 20000 18130 1.10
dot_opt2 20000 16831 1.19
dot_opt2 20000 16968 1.18
dot_opt2 20000 16391 1.22
dot_opt2 20000 17063 1.17
dot_opt2 20000 16315 1.23
dot_opt3 20000 77013 0.26
dot_opt3 20000 12419 1.61
dot_opt3 20000 12124 1.65
dot_opt3 20000 12193 1.64
dot_opt3 20000 12051 1.66
dot_opt3 20000 12009 1.67
dot_opt3 20000 11944 1.67
dot_opt3 20000 12032 1.66
dot_opt3 20000 12687 1.58
dot_opt3 20000 12188 1.64
Vectorization
OpenMP-4.0 added the omp simd construct, which is a portable way to request that the compiler vectorize code.
An example of a reason why a compiler might fail to vectorize code is aliasing, which we investigate below.
render_c('triad.c')
#include <stdlib.h>
void triad(size_t N, double *a, const double *b, double scalar, const double *c) {
for (size_t i=0; i<N; i++)
a[i] = b[i] + scalar * c[i];
}
!gcc -O2 -ftree-vectorize -fopt-info-all -c triad.c
Unit growth for small function inlining: 15->15 (0%)
Inlined 0 calls, eliminated 0 functions
triad.c:4:3: optimized: loop vectorized using 16 byte vectors
triad.c:4:3: optimized: loop versioned for vectorization because of possible aliasing
triad.c:3:6: note: vectorized 1 loops in function.
triad.c:4:3: optimized: loop turned into non-loop; it never loops
- gcc autovectorization starts at
-O3or if you use-ftree-vectorize - options such as -fopt-info give useful diagnostics, but are compiler-dependent and sometimes referring to assembly is useful
man gccwith search (/) is your friend
What is aliasing?
Is this valid code? What xs x after this call?
double x[5] = {1, 2, 3, 4, 5};
triad(2, x+1, x, 10., x);
C allows memory to overlap arbitrarily. You can inform the compiler of this using the restrict qualifier (C99/C11; __restrict or __restrict__ work with most C++ and CUDA compilers).
render_c('triad-restrict.c')
void triad(int N, double *restrict a, const double *restrict b, double scalar, const double *restrict c) {
for (int i=0; i<N; i++)
a[i] = b[i] + scalar * c[i];
}
!gcc -O2 -march=native -ftree-vectorize -fopt-info-all -c triad-restrict.c
Unit growth for small function inlining: 15->15 (0%)
Inlined 0 calls, eliminated 0 functions
triad-restrict.c:2:5: optimized: loop vectorized using 32 byte vectors
triad-restrict.c:1:6: note: vectorized 1 loops in function.
Notice how there is no more loop versioned for vectorization because of possible aliasing.
The complexity of checking for aliasing can grow combinatorially in the number of arrays being processed, leading to many loop variants and/or preventing vectorization.
Aside: Warnings
The -Wrestrict flag (included in -Wall) can catch some programming errors
void foo(double *x) {
triad(2, x, x, 10, x);
}
!gcc -O2 -Wall -c triad-foo.c
The powers of -Wrestrict are limited, however, and (as of gcc-9) do not even catch
void foo(double *x) {
triad(2, x+1, x, 10, x);
}
Check the assembly
!objdump -d --prefix-addresses -M intel triad-restrict.o
triad-restrict.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <triad> test edi,edi
0000000000000002 <triad+0x2> jle 0000000000000067 <triad+0x67>
0000000000000004 <triad+0x4> lea eax,[rdi-0x1]
0000000000000007 <triad+0x7> cmp eax,0x2
000000000000000a <triad+0xa> jbe 0000000000000074 <triad+0x74>
000000000000000c <triad+0xc> mov r8d,edi
000000000000000f <triad+0xf> shr r8d,0x2
0000000000000013 <triad+0x13> vbroadcastsd ymm2,xmm0
0000000000000018 <triad+0x18> shl r8,0x5
000000000000001c <triad+0x1c> xor eax,eax
000000000000001e <triad+0x1e> xchg ax,ax
0000000000000020 <triad+0x20> vmovupd ymm1,YMMWORD PTR [rcx+rax*1]
0000000000000025 <triad+0x25> vfmadd213pd ymm1,ymm2,YMMWORD PTR [rdx+rax*1]
000000000000002b <triad+0x2b> vmovupd YMMWORD PTR [rsi+rax*1],ymm1
0000000000000030 <triad+0x30> add rax,0x20
0000000000000034 <triad+0x34> cmp rax,r8
0000000000000037 <triad+0x37> jne 0000000000000020 <triad+0x20>
0000000000000039 <triad+0x39> mov eax,edi
000000000000003b <triad+0x3b> and eax,0xfffffffc
000000000000003e <triad+0x3e> test dil,0x3
0000000000000042 <triad+0x42> je 0000000000000070 <triad+0x70>
0000000000000044 <triad+0x44> vzeroupper
0000000000000047 <triad+0x47> cdqe
0000000000000049 <triad+0x49> nop DWORD PTR [rax+0x0]
0000000000000050 <triad+0x50> vmovsd xmm1,QWORD PTR [rcx+rax*8]
0000000000000055 <triad+0x55> vfmadd213sd xmm1,xmm0,QWORD PTR [rdx+rax*8]
000000000000005b <triad+0x5b> vmovsd QWORD PTR [rsi+rax*8],xmm1
0000000000000060 <triad+0x60> inc rax
0000000000000063 <triad+0x63> cmp edi,eax
0000000000000065 <triad+0x65> jg 0000000000000050 <triad+0x50>
0000000000000067 <triad+0x67> ret
0000000000000068 <triad+0x68> nop DWORD PTR [rax+rax*1+0x0]
0000000000000070 <triad+0x70> vzeroupper
0000000000000073 <triad+0x73> ret
0000000000000074 <triad+0x74> xor eax,eax
0000000000000076 <triad+0x76> jmp 0000000000000047 <triad+0x47>
- How do the results change if you go up and replace
-march=nativewith-march=skylake-avx512 -mprefer-vector-width=512? - Is the assembly qualitatively different without
restrict(in which case the compiler “versions” the loop).
Pragma omp simd
An alternative (or supplement) to restrict is #pragma omp simd.
render_c('triad-omp-simd.c')
void triad(int N, double *a, const double *b, double scalar, const double *c) {
#pragma omp simd
for (int i=0; i<N; i++)
a[i] = b[i] + scalar * c[i];
}
!gcc -O2 -march=native -ftree-vectorize -fopenmp -fopt-info-all -c triad-omp-simd.c
Unit growth for small function inlining: 15->15 (0%)
Inlined 0 calls, eliminated 0 functions
triad-omp-simd.c:4:17: optimized: loop vectorized using 32 byte vectors
triad-omp-simd.c:1:6: note: vectorized 1 loops in function.