Cache-Optimized Matrix Multiplication

IEEE-style research project — Design, implementation and experimental evaluation of cache-aware matrix multiplication in C, targeting x86-64 architecture with AVX2/FMA SIMD vectorization.

Authors: Dila Öykü Eyüboğlu · Burcu Kösedağı
Affiliation: Computer Engineering, Istanbul Commerce University
Platform: Intel Core i5-10210U · Linux WSL2 · GCC 13.3 · AVX2 + FMA

Key Results

40.25× speedup over naive baseline
455× reduction in LLC cache misses
98.4% → 3.9% D1 miss rate reduction
Fully verified with Valgrind & Cachegrind

Matrix Size	Naïve Time	Optimized Time	SIMD Time	Speedup (Opt)	Speedup (SIMD)
N = 256	0.027 s	0.005 s	0.008 s	5.15×	3.33×
N = 512	0.266 s	0.054 s	0.078 s	4.94×	3.40×
N = 1024	8.007 s	0.434 s	0.591 s	18.46×	13.54×
N = 2048	142.5 s	3.54 s	5.18 s	40.25×	27.51×

Overview

Matrix multiplication (C = A × B) is a foundational computational kernel in machine learning, scientific simulation, and graphics rendering. Despite its O(N³) arithmetic complexity, real-world performance is dictated by memory access patterns, not raw floating-point throughput.

This project demonstrates the memory wall effect quantitatively:

Naïve (N=2048):       0.12 GFLOPS   ← 90% degradation from N=256
Optimized (N=2048):   4.85 GFLOPS   ← stable across all sizes
SIMD (N=2048):        3.32 GFLOPS   ← stable across all sizes

The naïve implementation accesses matrix B column-wise (stride-N), causing virtually every access to miss L1 cache. Loop interchange, tiling, and vectorization eliminate this bottleneck.

Implementations

1. Naïve (`multiply_naive.c`)

Standard i → j → k loop order. B is accessed with stride N — cache-hostile.

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i*N+j] += A[i*N+k] * B[k*N+j];  // stride-N miss on B

Problem: For N=1024, each B access spans 8,192 bytes between rows → L1 miss every iteration.

2. Optimized (`multiply_opt.c`)

Three combined techniques:

① Loop Interchange (i → k → j)

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++) {
        double a_ik = A[i*N+k];   // hoisted to register
        for (j = 0; j < N; j++)
            C[i*N+j] += a_ik * B[k*N+j];  // stride-1 on B and C
    }

② Cache Blocking (BLOCK_SIZE = 64)

#define BLOCK_SIZE 64
for (ii = 0; ii < N; ii += BLOCK_SIZE)
  for (kk = 0; kk < N; kk += BLOCK_SIZE)
    for (jj = 0; jj < N; jj += BLOCK_SIZE)
      // inner 64×64 tile fits in 32 KB L1-D cache

A single 64×64 tile = 32,768 bytes = one L1 cache instance. Three tiles (A, B, C) = 96 KB → fit in L2 (256 KB).

③ Manual Loop Unrolling (×4)

for (; j + 3 < j_max; j += 4) {
    C[i*N+j]   += a_ik * B[k*N+j];
    C[i*N+j+1] += a_ik * B[k*N+j+1];
    C[i*N+j+2] += a_ik * B[k*N+j+2];
    C[i*N+j+3] += a_ik * B[k*N+j+3];
}
for (; j < j_max; j++)   // scalar remainder
    C[i*N+j] += a_ik * B[k*N+j];

3. SIMD AVX2 (`multiply_simd.c`)

Replaces the scalar inner loop with 256-bit FMA intrinsics, processing 4 doubles per instruction:

#ifdef __AVX2__
#include <immintrin.h>
__m256d va = _mm256_set1_pd(a_ik);         // broadcast scalar to 4 lanes
for (; j + 3 < j_max; j += 4) {
    __m256d vb = _mm256_loadu_pd(&B[k*N+j]);
    __m256d vc = _mm256_loadu_pd(&C[i*N+j]);
    vc = _mm256_fmadd_pd(va, vb, vc);       // vc += va × vb (4 doubles/cycle)
    _mm256_storeu_pd(&C[i*N+j], vc);
}
#endif

GCC-emitted assembly (verified via gcc -S):

vbroadcastsd  %xmm2, %ymm1       ; broadcast a_ik to all 4 lanes
vmovupd       (%r8,%rcx,8), %ymm0 ; load 4 B values
vfmadd213pd   (%rdi), %ymm1, %ymm0 ; fused multiply-add (4 doubles)
vmovupd       %ymm0, (%rdi)        ; store 4 C values

Project Structure

cache_optimization_project/
│
├── include/
│   ├── matrix.h          # Matrix struct + allocation API
│   ├── multiply.h        # Function declarations (naive / opt / simd)
│   ├── timer.h           # High-resolution timer (POSIX + Win32)
│   └── verify.h          # Element-wise correctness check
│
├── src/
│   ├── main.c            # Benchmark driver + CSV export
│   ├── matrix.c          # create_matrix(), free_matrix(), fill_matrix()
│   ├── multiply_naive.c  # Baseline i→j→k
│   ├── multiply_opt.c    # Blocked + unrolled (i→k→j, BLOCK_SIZE=64)
│   ├── multiply_simd.c   # AVX2 FMA vectorized
│   ├── timer.c           # clock_gettime / QueryPerformanceCounter
│   └── verify.c          # compare_matrices() with ε = 1×10⁻⁹
│
├── figures/
│   ├── speedup_plot.png          # Speedup vs. N
│   ├── gflops_plot.png           # Throughput vs. N
│   ├── runtime_plot.jpg          # Execution time (log scale)
│   ├── cache_miss_plot.jpg       # D1 & LLd miss rates (Cachegrind)
│   ├── memory_hierarchy.jpg      # x86-64 cache hierarchy diagram
│   └── system_architecture.jpg  # Benchmark pipeline diagram
│
├── results/
│   ├── benchmark_results_aggregated.csv  # Median of 5 runs per config
│   ├── benchmark_results_all_runs.csv    # Raw per-run data
│   └── cachegrind_report.txt            # Valgrind cache profiling output
│
├── paper/
│   └── Cache_Optimized_Matrix_Multiplication.pdf
│
├── scripts/
│   ├── run_cachegrind.sh   # Valgrind Cachegrind profiling (Linux)
│   └── run_perf.sh         # Linux perf hardware counters
│
├── makefile
└── README.md

Requirements

Tool	Version	Purpose
GCC	≥ 9.0	Compilation with AVX2 (`-mavx2 -mfma`)
make	any	Build system
valgrind + cg_annotate	any	Cache profiling (`make cachegrind`)
perf	any	Hardware performance counters (`make perf`)

Install on Linux (Ubuntu/Debian)

sudo apt update
sudo apt install gcc make valgrind linux-tools-generic

Windows (MinGW/MSYS2)

AVX2 supported; valgrind and perf targets are Linux-only.

Verify AVX2 support

lscpu | grep -E "avx2|fma"
# Expected: avx2 fma in Flags section

Build & Run

Build with AVX2 SIMD (default)

make

Compiler flags: -O3 -Wall -Wextra -std=c11 -march=native -mavx2 -mfma

Build without SIMD (scalar fallback)

make nosimd

Run full benchmark suite

make run
# or directly:
./matrix_app

Benchmarks N ∈ {256, 512, 1024, 2048}, 5 runs each. Results written to benchmark_results.csv.

Inspect AVX2 assembly

gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
grep -A5 "vfmadd" multiply_simd.s

Clean build artifacts

make clean

Output Format

Terminal Output (example at N=1024)

AVX2 ENABLED
Cache-Optimized Matrix Multiplication Benchmark

N = 1024
C[0][0]     naive=2048.00   opt=2048.00   simd=2048.00
Correctness opt=PASS         simd=PASS
Time(s)     naive=8.007      opt=0.434     simd=0.591
Speedup     opt=18.46x       simd=13.54x
GFLOPS      naive=0.268      opt=4.951     simd=3.631

CSV Columns (`benchmark_results.csv`)

Column	Description
`N`	Matrix size
`Correct_Opt`	Correctness of optimized (PASS/FAIL)
`Correct_SIMD`	Correctness of SIMD (PASS/FAIL)
`NaiveTime`	Median wall-clock time (s)
`OptTime`	Median wall-clock time (s)
`SIMDTime`	Median wall-clock time (s)
`Speedup_Opt`	T_naive / T_opt
`Speedup_SIMD`	T_naive / T_simd
`NaiveGFLOPS`	2N³ / (T × 10⁹)
`OptGFLOPS`	2N³ / (T × 10⁹)
`SIMDGFLOPS`	2N³ / (T × 10⁹)

Cache Profiling

Run Valgrind Cachegrind

make cachegrind
# or manually:
valgrind --tool=cachegrind --cache-sim=yes --branch-sim=yes \
         --cachegrind-out-file=cachegrind.out ./matrix_app

cg_annotate --show=Ir,Dr,D1mr,DLmr,Dw,D1mw,DLmw \
            cachegrind.out > cachegrind_report.txt

Results at N=1024 (Valgrind Cachegrind)

Implementation	D1 Read Misses	LLd Read Misses	D1 Miss Rate	LLd Miss Rate
Naïve	≈1.08×10⁹	≈1.07×10⁹	≈98.4%	≈93.1%
Optimized	≈1.56×10⁸	≈2.36×10⁶	≈3.9%	≈0.19%
SIMD	≈1.56×10⁸	≈2.36×10⁶	≈3.8%	≈0.18%

455× reduction in LLd read misses · 25× reduction in D1 miss rate

Performance Counters

make perf
# or:
bash scripts/run_perf.sh

Reports hardware events (real cache misses, CPI, pipeline stalls) via Linux perf stat. Output saved to perf_report.txt.

Correctness Verification

All implementations are verified against the naïve baseline with element-wise tolerance ε = 1×10⁻⁹:

∀ i, j :  |C_opt[i][j] − C_naive[i][j]| ≤ 1×10⁻⁹

Test case: A[i][j] = 1.0, B[i][j] = 2.0 → expected C[i][j] = 2N for all i, j.

All three implementations pass for all N across all 5 runs.

Memory safety: Verified with valgrind --tool=memcheck — zero leaks, zero invalid accesses across all configurations.

Optimization Techniques

Why naïve is slow

In the i → j → k loop order, B[k][j] is accessed column-wise:

For N=1024: stride = 1024 × 8 bytes = 8,192 bytes between consecutive accesses
L1 cache (32 KB) holds only 512 doubles → no reuse of B in L1
Result: D1 miss rate ≈ 98.4%, CPU stalls on every inner-loop iteration

Why the optimized version is fast

Technique	Effect
Loop interchange (i→k→j)	All three matrices traversed stride-1; `a_ik` hoisted to register
Cache blocking (64×64)	Working set (3 tiles = 96 KB) fits in L2; misses drop from O(N³) to O(N²/B)
Loop unrolling (×4)	4 independent MACs per iteration; improves instruction-level parallelism
AVX2 FMA	4 doubles per instruction; fused multiply-add halves arithmetic instruction count

Memory hierarchy fit (Intel Core i5-10210U)

CPU Registers  (YMM) │ 1 cycle      │ ← a_ik hoisted here
L1-D Cache  32 KB    │ 4–5 cycles   │ ← one 64×64 tile (32 KB) fits here
L2 Cache   256 KB    │ ~12 cycles   │ ← three tiles (96 KB) fit here
L3 Cache     6 MiB   │ 30–40 cycles │
DRAM        16 GB    │ 200–300 cyc  │ ← naïve accesses land here

Experimental Setup

Parameter	Value
CPU	Intel Core i5-10210U (Comet Lake), 4C/8T, 1.6–4.2 GHz
L1-D	128 KB (4 × 32 KB, 8-way, 64B lines)
L2	1 MiB (4 × 256 KB)
L3	6 MiB shared
OS	Linux WSL2 (Ubuntu 24.04), kernel 6.6.87.2
Compiler	GCC 13.3.0, `-O3 -march=native -mavx2 -mfma`
Matrix sizes	N ∈ {256, 512, 1024, 2048}, square, double precision
Runs	5 independent; median reported
Timer	`clock_gettime(CLOCK_MONOTONIC)`

Reproducibility

The full benchmark suite, Cachegrind profiling, and assembly inspection can each be triggered with a single command:

# Full benchmark
make clean && make && make run

# Valgrind cache profiling
make cachegrind

# AVX2 assembly inspection
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s

# Scalar build (no SIMD) for comparison
make nosimd && make run

All results are written to results/benchmark_results_aggregated.csv for reproducible post-processing.

References

Selected key references (full list in paper):

Drepper, U. — What Every Programmer Should Know About Memory, Red Hat, 2007
Lam, Rothberg, Wolf — The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS IV, 1991
Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model, CACM, 2009
Goto & van de Geijn — Anatomy of High-Performance Matrix Multiplication, ACM TOMS, 2008
Intel Corporation — Intel® 64 and IA-32 Architectures Optimization Reference Manual, 2023

For the full experimental analysis, see paper/Cache_Optimized_Matrix_Multiplication.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
figures		figures
include		include
paper		paper
results		results
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
makefile		makefile

Folders and files

Latest commit

History

Repository files navigation

Cache-Optimized Matrix Multiplication

Key Results

Table of Contents

Overview

Implementations

1. Naïve (multiply_naive.c)

2. Optimized (multiply_opt.c)

3. SIMD AVX2 (multiply_simd.c)

Project Structure

Requirements

Install on Linux (Ubuntu/Debian)

Windows (MinGW/MSYS2)

Verify AVX2 support

Build & Run

Build with AVX2 SIMD (default)

Build without SIMD (scalar fallback)

Run full benchmark suite

Inspect AVX2 assembly

Clean build artifacts

Output Format

Terminal Output (example at N=1024)

CSV Columns (benchmark_results.csv)

Cache Profiling

Run Valgrind Cachegrind

Results at N=1024 (Valgrind Cachegrind)

Performance Counters

Correctness Verification

Optimization Techniques

Why naïve is slow

Why the optimized version is fast

Memory hierarchy fit (Intel Core i5-10210U)

Experimental Setup

Reproducibility

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Naïve (`multiply_naive.c`)

2. Optimized (`multiply_opt.c`)

3. SIMD AVX2 (`multiply_simd.c`)

CSV Columns (`benchmark_results.csv`)

Packages