Introduction to Performance Modeling

Why model performance?

Models give is a conceptual and roughly quantitative framework by which to answer the following types of questions.

Why is an implementation exhibiting its observed performance?
How will performance change if we:
- optimize this component?
- buy new hardware? (Which new hardware?)
- run a different configuration?
While conceptualizing a new algorithm, what performance can we expect and what will be bottlenecks?

Models are a guide for performance, but not an absolute.

Terms

Symbol	Meaning
$n$	Input parameter related to problem size
$W$	Amount of work to solve problem $n$
$T$	Execution time
$R$	Rate at which work is done

STREAM Triad

for (i=0; i<n; i++)
    a[i] = b[i] + scalar*c[i];

$n$ is the array size and $$W = 3 \cdot \texttt{sizeof(double)} \cdot n$$ is the number of bytes transferred. The rate $R = W/T$ is measured in bytes per second (or MB/s, etc.).

Dense matrix multiplication

To perform the operation $C \gets C + A B$ where $A,B,C$ are $n\times n$ matrices.

for (i=0; i<n; i++)
    for (j=0; j<n; j++)
        for (k=0; k<n; k++)
            c[i*n+j] += a[i*n+k] * b[k*n+j];

Can you identify two expressions for the total amount of work $W(n)$ and the associated units?
Can you think of a context in which one is better than the other and vice-versa?

Estimating time

To estimate time, we need to know how fast hardware executes flops and moves bytes.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas
import numpy as np
plt.style.use('seaborn')

hardware = pandas.read_csv('data-intel.csv', index_col="Name")
hardware

	Year	GFLOPs-SP	GFLOPs-DP	Cores	Mem-GBps	TDP	Freq(MHz)
Name
Xeon X5482	2007	102	51	4	26	150	3200
Xeon X5492	2008	108	54	4	26	150	3400
Xeon W5590	2009	106	53	4	32	130	3300
Xeon X5680	2010	160	80	6	32	130	3300
Xeon X5690	2011	166	83	6	32	130	3470
Xeon E5-2690	2012	372	186	8	51	135	2900
Xeon E5-2697 v2	2013	518	259	12	60	130	2700
Xeon E5-2699 v3	2014	1324	662	18	68	145	2300
Xeon E5-2699 v3	2015	1324	662	18	68	145	2300
Xeon E5-2699 v4	2016	1548	774	22	77	145	2200
Xeon Platinum 8180	2017	4480	2240	28	120	205	2500
Xeon Platinum 9282	2018	9320	4660	56	175	400	2600

fig = hardware.plot(x='GFLOPs-DP', y='Mem-GBps', marker='o')
fig.set_xlim(left=0)
fig.set_ylim(bottom=0);

png

So we have rates $R_f = 4660 \cdot 10^9$ flops/second and $R_m = 175 \cdot 10^9$ bytes/second. Now we need to characterize some algorithms.

algs = pandas.read_csv('algs.csv', index_col='Name')
algs['intensity'] = algs['flops'] / algs['bytes']
algs = algs.sort_values('intensity')
algs

	bytes	flops	intensity
Name
Triad	24	2	0.083333
SpMV	12	2	0.166667
Stencil27-cache	24	54	2.250000
MatFree-FEM	2376	15228	6.409091
Stencil27-ideal	8	54	6.750000

def exec_time(machine, alg, n):
    bytes = n * alg.bytes
    flops = n * alg.flops
    T_mem = bytes / (machine['Mem-GBps'] * 1e9)
    T_flops = flops / (machine['GFLOPs-DP'] * 1e9)
    return max(T_mem, T_flops)
    
exec_time(hardware.loc['Xeon Platinum 9282'], algs.loc['SpMV'], 1e8)

0.006857142857142857

for _, machine in hardware.iterrows():
    for _, alg in algs.iterrows():
        ns = np.geomspace(1e4, 1e9, 10)
        times = np.array([exec_time(machine, alg, n) for n in ns])
        flops = np.array([alg.flops * n for n in ns])
        rates = flops/times
        plt.loglog(ns, rates, 'o-')
plt.xlabel('n')
plt.ylabel('rate');

png

It looks like performance does not depend on problem size.

Well, yeah, we chose a model in which flops and bytes were both proportional to $n$, and our machine model has no sense of cache hierarchy or latency, so time is also proportional to $n$. We can divide through by $n$ and yield a more illuminating plot.

for _, machine in hardware.iterrows():
    times = np.array([exec_time(machine, alg, 1) 
                      for _, alg in algs.iterrows()])
    rates = algs.flops/times
    intensities = algs.intensity
    plt.loglog(intensities, rates, 'o-', label=machine.name)
plt.xlabel('intensity')
plt.ylabel('rate')
plt.legend();

png

We’re seeing the “roofline” for the older processors while the newer models are memory bandwidth limited for all of these algorithms.