Dense Linear Algebra and Networks

Inner products

$$ x^T y = \sum_{i=1}^N x_i y_i $$

OpenMP

The vectors x and y of length N are stored in a contiguous array in shared memory.

double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i=0; i<N; i++)
    sum += x[i] * y[i];

MPI

The vectors x and y are partitioned into $P$ parts of length $np$ such that $$ N = \sum{p=1}^P n_p . $$ The inner product is computed via

double sum = 0;
for (int i=0; i<n; i++)
    sum += x[i] * y[i];
MPI_Allreduce(MPI_IN_PLACE, &sum, 1, MPI_DOUBLE, MPI_SUM, comm);
  • Work: $2N$ flops processed rate $R$
  • Execution time: $\frac{2N}{RP} + \text{latency}$
  • How big is latency?

    %matplotlib inline
    import matplotlib.pyplot as plt
    plt.style.use('seaborn')
    import numpy as np
    
    P = np.geomspace(2, 1e6)
    N = 1e9       # length of vectors
    R = 10e9/8    # (10 GB/s per core) (2 flops/16 bytes) = 10/8 GF/s per core
    t1 = 2e-6     # 2 µs message latency
    
    def time_compute(P):
    return 2*N / (R*P)
    
    plt.loglog(P, time_compute(P) + t1*(P-1), label='linear')
    plt.loglog(P, time_compute(P) + t1*2*(np.sqrt(P)-1), label='2D mesh')
    plt.loglog(P, time_compute(P) + t1*np.log2(P), label='hypercube')
    plt.xlabel('P processors')
    plt.ylabel('Execution time')
    plt.legend();
    

png

Torus topology

  • 3D torus: IBM BlueGene/L (2004) and BlueGene/P (2007)
  • 5D torus: IBM BlueGene/Q (2011)
  • 6D torus: Fujitsu K computer (2011)

Dragonfly topology

Today’s research: reducing contention and interference

Images from this article.

Compare to BG/Q

  • Each job gets an electrically isolated 5D torus
  • Excellent performance and reproducibility
  • Awkward constraints on job size, lower system utilization.

Outer product

$$ C_{ij} = x_i y_j $$

  • Data in: $2N$
  • Data out: $N^2$

Matrix-vector products

$$ yi = \sum{j} A_{ij} x_j $$

How to partition the matrix $A$ across $P$ processors?

1D row partition

  • Every process needs entire vector $x$: MPI_Allgather
  • Matrix data does not move
  • Execution time $$ \underbrace{\frac{2N^2}{RP}}_{\text{compute}} + \underbrace{t_1 \log2 P}{\text{latency}} + \underbrace{tb N \frac{P-1}{P}}{\text{bandwidth}} $$

Thanks to Abtin Rahimian

2D partition

  • Blocks of size $N/\sqrt{P}$
  • “diagonal” ranks hold the input vector
  • Broadcast $x$ along columns: MPI_Bcast
  • Perform local compute
  • Sum y along rows: MPI_Reduce with roots on diagonal
  • Execution time $$ \underbrace{\frac{2N^2}{RP}}_{\text{compute}} + \underbrace{2 t_1 \log2 P}{\text{latency}} + \underbrace{\frac{2 tb N}{\sqrt{P}}}{\text{bandwidth}} $$

Thanks to Abtin Rahimian

N = 1e4
tb = 8 / 1e9 # 8 bytes / (1 GB/s) ~ bandwidth per core in units of double
tb *= 100

plt.loglog(P, (2*N**2)/(R*P) + t1*np.log2(P) + tb*N*(P-1)/P, label='1D distribution')
plt.loglog(P, (2*N**2)/(R*P) + 2*t1*np.log2(P) + 2*tb*N/np.sqrt(P), label='2D distribution')
plt.xlabel('P processors')
plt.ylabel('Execution time')
plt.legend();

png

Previous
Next