ISPC, OpenMP target, OpenACC, and all that

def render_c(filename):
    from IPython.display import Markdown
    with open(filename) as f:
        contents = f.read()
    return Markdown("```c\n" + contents + "```\n")

Architecture	Directives	SIMD	SPMD
Intel AVX+ (SIMD)	`#pragma omp simd`	intrinsics	ISPC
CUDA (SIMT)	`#pragma omp target`	C++ templates and other high-level APIs	CUDA

ISPC: Intel SPMD Program Compiler

We can program SIMT (e.g., CUDA) devices using directives, but we can also program SIMD (e.g., Intel CPUs) using a SPMD (CUDA-like, acronym comes from “single program” versus “single instruction”) programming model.

render_c('simple-ispc.ispc')

export void simple_ispc(uniform double vin[], uniform double vout[],
                        uniform int count) {
  foreach (index = 0 ... count) {
    double v = vin[index];
    if (v < 3.)
      v = v * v;
    else
      v = sqrt(v);
    vout[index] = v;
  }
}

This function is callable from native C code.

render_c('simple.c')

#include <stdio.h>
#include <math.h>

void simple_ispc(double vin[], double vout[], int count);

void simple_c(double vin[], double vout[], int count) {
  for (int index=0; index<count; index++) {
    double v = vin[index];
    if (v < 3.)
      v = v * v;
    else
      v = sqrt(v);
    vout[index] = v;
  }
}

int main() {
  double vin[16], vout[16];
  for (int i = 0; i < 16; ++i)
    vin[i] = i;

  simple_ispc(vin, vout, 16);

  for (int i = 0; i < 16; ++i)
    printf("%d: simple_ispc(%f) = %f\n", i, vin[i], vout[i]);

  simple_c(vin, vout, 16);

  for (int i = 0; i < 16; ++i)
    printf("%d: simple_c(%f) = %f\n", i, vin[i], vout[i]);
  return 0;
}

! make -B simple && ./simple

cc -O3 -march=native   -c -o simple.o simple.c
ispc -O3 --target=avx2-i32x8 simple-ispc.ispc -o simple-ispc.o
cc   simple.o simple-ispc.o  -lm -o simple
0: simple_ispc(0.000000) = 0.000000
1: simple_ispc(1.000000) = 1.000000
2: simple_ispc(2.000000) = 4.000000
3: simple_ispc(3.000000) = 1.732051
4: simple_ispc(4.000000) = 2.000000
5: simple_ispc(5.000000) = 2.236068
6: simple_ispc(6.000000) = 2.449490
7: simple_ispc(7.000000) = 2.645751
8: simple_ispc(8.000000) = 2.828427
9: simple_ispc(9.000000) = 3.000000
10: simple_ispc(10.000000) = 3.162278
11: simple_ispc(11.000000) = 3.316625
12: simple_ispc(12.000000) = 3.464102
13: simple_ispc(13.000000) = 3.605551
14: simple_ispc(14.000000) = 3.741657
15: simple_ispc(15.000000) = 3.872983
0: simple_c(0.000000) = 0.000000
1: simple_c(1.000000) = 1.000000
2: simple_c(2.000000) = 4.000000
3: simple_c(3.000000) = 1.732051
4: simple_c(4.000000) = 2.000000
5: simple_c(5.000000) = 2.236068
6: simple_c(6.000000) = 2.449490
7: simple_c(7.000000) = 2.645751
8: simple_c(8.000000) = 2.828427
9: simple_c(9.000000) = 3.000000
10: simple_c(10.000000) = 3.162278
11: simple_c(11.000000) = 3.316625
12: simple_c(12.000000) = 3.464102
13: simple_c(13.000000) = 3.605551
14: simple_c(14.000000) = 3.741657
15: simple_c(15.000000) = 3.872983

! objdump -d --prefix-addresses -M intel simple | grep sqrt

0000000000001050 <sqrt@plt> jmp    QWORD PTR [rip+0x2fd2]        # 0000000000004028 <sqrt@GLIBC_2.2.5>
0000000000001056 <sqrt@plt+0x6> push   0x2
000000000000105b <sqrt@plt+0xb> jmp    0000000000001020 <.plt>
00000000000012ec <simple_c+0x4c> vsqrtsd xmm1,xmm0,xmm0
0000000000001302 <simple_c+0x62> call   0000000000001050 <sqrt@plt>
000000000000142d <simple_ispc___un_3C_und_3E_un_3C_und_3E_uni+0xdd> vsqrtpd ymm1,ymm4
0000000000001431 <simple_ispc___un_3C_und_3E_un_3C_und_3E_uni+0xe1> vsqrtpd ymm7,ymm5
000000000000156e <simple_ispc___un_3C_und_3E_un_3C_und_3E_uni+0x21e> vsqrtpd ymm2,ymm6
0000000000001577 <simple_ispc___un_3C_und_3E_un_3C_und_3E_uni+0x227> vsqrtpd ymm3,ymm7
000000000000168d <simple_ispc+0xdd> vsqrtpd ymm1,ymm4
0000000000001691 <simple_ispc+0xe1> vsqrtpd ymm7,ymm5
00000000000017ce <simple_ispc+0x21e> vsqrtpd ymm2,ymm6
00000000000017d7 <simple_ispc+0x227> vsqrtpd ymm3,ymm7

ISPC is a good option for code with cross-lane dependencies or vector lane divergence (branches that affect some lanes differently than others). Writing such code with intrinsics is laborious and compilers often do a poor job of inferring good vectorization strategies (despite #pragma omp simd and the like). An example of successful use of ISPC is Intel’s Embree ray tracing engine.

(As with most vendor-reported performance numbers, we can probably take this with a grain of salt. But it indicates that CPUs remain highly competitive for ray tracing.)

OpenMP target offload and OpenACC

CUDA is relatively hard to maintain and logic/tuning is spread out (between the kernel launch and the device code). OpenMP target offload and OpenACC attempt to provide a more friendly story for maintenance and incremental migration of legacy code.

Terminology

CUDA Concept	CUDA keyword	OpenACC	OpenMP `target`
Thread block	`blockIdx`	`gang`	`teams`
Warp	(implicit)	`worker`	thread
Thread	`threadIdx`	`vector`	`simd`

Incremental porting with unified memory

Example

OpenACC example from a Lattice-Boltzman miniapp

void LBM::stream(Real* const __restrict a_f,
                 const Real* const __restrict a_f_post,
                 const int* a_loStr,
                 const int* a_hiStr,
                 const int* a_loAll,
                 const int* a_hiAll,
                 const int a_numPts) const
{

  const int* const __restrict latI = &m_lattice[0][0];
  const int* const __restrict latJ = &m_lattice[1][0];
  const int* const __restrict latK = &m_lattice[2][0];

  const int
    klo = a_loStr[2], khi = a_hiStr[2],
    jlo = a_loStr[1], jhi = a_hiStr[1],
    ilo = a_loStr[0], ihi = a_hiStr[0];

#pragma acc parallel loop independent collapse(3) \
        copyin(a_loAll[SPACEDIM],a_hiAll[SPACEDIM],a_f_post[a_numPts*m_numVels]) \
        copyout(a_f[a_numPts*m_numVels]) vector_length(256)
  for (int k = klo; k <= khi; ++k) {
    for (int j = jlo; j <= jhi; ++j) {
      for (int i = ilo; i <= ihi; ++i) {
#pragma acc loop seq independent
        for (int m = 0; m < NUMV; ++m) {
          const long int offset = m * a_numPts;
          const long int index0 = INDEX(i          ,           j,           k, a_loAll, a_hiAll);
          const long int index2 = INDEX(i - latI[m], j - latJ[m], k - latK[m], a_loAll, a_hiAll);
          a_f[index0 + offset]    = a_f_post[index2 + offset];  // new f comes from upwind
        }
      }
    }
  }
}

Resources

Last updated on Nov 4, 2019

Edit this page