Intro to Parallel Scaling

Programs with more than one part

So far, we’ve focused on simple programs with only one part, but real programs have several different parts, often with data dependencies. Some parts will be amenable to optimization and/or parallelism and others will not. This principle is called Amdahl’s Law.

def exec_time(f, p, n=10, latency=1):
    # Suppose that a fraction f of the total work is amenable to optimization
    # We run a problem size n with parallelization factor p
    return latency + (1-f)*n + f*n/p

%matplotlib inline
import matplotlib.pyplot as plt
import pandas
import numpy as np
plt.style.use('seaborn')

ps = np.geomspace(1, 1000)

plt.loglog(ps, exec_time(.99, ps, latency=0))
plt.loglog(ps, exec_time(1, ps, latency=0))
plt.title('Strong scaling')
plt.xlabel('p')
plt.ylabel('time');

png

Strong scaling: fixed total problem size

Cost = `time * p`

def exec_cost(f, p, **kwargs):
    return exec_time(f, p, **kwargs) * p

plt.loglog(ps, exec_cost(.99, ps))
plt.title('Strong scaling')
plt.xlabel('p')
plt.ylabel('cost');

png

Efficiency

plt.semilogx(ps, 1/exec_cost(.99, ps, latency=1))
plt.title('Strong scaling')
plt.xlabel('p')
plt.ylabel('efficiency')
plt.ylim(bottom=0);

png

Speedup

$$ S(p) = \frac{T(1)}{T(p)} $$

plt.plot(ps, exec_time(.99, 1, latency=1) / exec_time(.99, ps, latency=1))
plt.title('Strong scaling')
plt.xlabel('p')
plt.ylabel('speedup')
plt.ylim(bottom=0);

png

Stunt 1: Report speedup, not absolute performance!

Hager

Efficiency-Time spectrum (my preference)

People care about two observable properties * Time until job completes * Cost in core-hours or dollars to do job

Most HPC applications have access to large machines, so don’t really care how many processes they use for any given job.

plt.plot(exec_time(.99, ps), 1/exec_cost(.99, ps), 'o-')
plt.title('Strong scaling')
plt.xlabel('time')
plt.ylabel('efficiency')
plt.ylim(bottom=0);
plt.xlim(left=0);

png

Principles

No “soft” log scale
Both axes have tangible units
Bigger is better on the $y$ axis

Weak Scaling: Fixed work per processor

We’ve kept the problem size $n$ fixed thus far, but parallel computers are also used to solve large problems. If we keep the amount of work per processor fixed, we are weak/Gustafson scaling.

ns = 10*ps
plt.semilogx(ps, ns/exec_cost(.99, ps, n=ns, latency=1), 'o-')
ns = 100*ps
plt.semilogx(ps, ns/exec_cost(.99, ps, n=ns, latency=1), 's-')
plt.title('Weak scaling')
plt.xlabel('procs')
plt.ylabel('efficiency')
plt.ylim(bottom=0);

png

for w in np.geomspace(0.1, 1e3, 20):
    ns = w*ps
    plt.semilogx(exec_time(.99, ps, n=ns, latency=1),
                 ns/exec_cost(.99, ps, n=ns, latency=1), 'o-')
plt.title('Weak scaling')
plt.xlabel('time')
plt.ylabel('efficiency')
plt.ylim(bottom=0);

png

Fuhrer et al (2018): Near-global climate simulation at 1 km resolution

Fuhrer (2018)

I replotted these data for my talk at the Latsis Symposium last month.

Fuhrer (2018) replotted

Further resources

Hager: Fooling the masses
- Learn by counter-examples
Hoefler and Belli: Scientific Benchmarking of Parallel Computing Systems
- Recommended best practices, especially for dealing with performance variability

Please read/watch something from this list and be prepared to share on Monday.

Last updated on Sep 6, 2019

Edit this page