Intro to Architecture

Cores, caches, and memory

A von Neumann Architecture

A contemporary architecture

Core 2

My laptop

We can get this kind of information for our machine using hwloc, which provides a library as well as the command-line tool lstopo.

!lstopo --output-format svg > lstopo-local.svg

lstopo

A double-socket compute node with two GPUs

2x Xeon Ivy-Bridge-EP E5-2680v2 + 2x NVIDIA GPUs (from 2013, with hwloc v1.11). GPUs are reported as CUDA devices and X11 display :1.0: (from the hwloc gallery)

Block diagrams

A block diagram from a vendor can include additional information about how cores are physically connected.

Ring bus (Xeon E5-2600 family)

Intel Xeon E5-2600

Mesh bus (Xeon Scalable family)

Intel Xeon Scalable

Multi-socket configurations

4-socket ring

https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview

Multiple nodes go into racks or cabinets

Blue Gene/P Racks

OLCF Summit

Terminology

Core (virtual and physical): has a single program counter (logically sequential processing of instructions)
Memory channel: e.g., DDR4-2400: transfers 64 bits (8 bytes) at a rate of 2400 MHz = 15.36 GB/s
Socket or CPU: contains multiple cores in a single piece* of silicon
Non-Uniform Memory Access (NUMA): different channels may be different distances from a core
Compute node: one or more sockets, plus memory, network card, etc.

How expensive is it to access memory?

What does that mean? How would we measure?

McKenney (2013): Laws of Physics

Interactive

Variation by vendor

How your program accesses memory

double a[1000];

void foo() {
    for (int i=0; i<1000; i++)
        a[i] = 1.234 * i;
}

The compiler turns the loop body into instructions, which we can examine using Godbolt.

pxor xmm0, xmm0                  ; zero the xmm0 register
cvtsi2sd xmm0, eax               ; convert the integer i to double
mulsd xmm0, xmm1                 ; multiply by 1.234 (held in xmm1)
movsd QWORD PTR a[0+rax*8], xmm0 ; store to memory address a[i]

Only one instruction here accesses memory, and the performance will be affected greatly by where that memory resides (which level of cache, where in DRAM).

Most architectures today have 64-byte cache lines: all transfers from main memory (DRAM) to and from cache operate in units of 64 bytes.

Let’s compare three code samples

for (int i=0; i<N; i++)
    a[i] = b[i];

for (int i=0; i<N; i++)
    a[i] = b[(i*8) % N];

for (int i=0; i<N; i++)
    a[i] = b[random() % N];

What happens when you request a cache line?

Operating system effects

Most systems today use virtual addressing, so every address in your program needs to be translated to a physical address before looking for it (in cache or memory). Fortunately, there is hardware to assist with this: the Translation Lookaside Buffer (TLB).

Virtual memory and the page table

Further resources

Last updated on Aug 30, 2019

Edit this page