Skip to content

Systems-Performance-Lab/cache-optimized-matrix-multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Cache-Optimized Matrix Multiplication

IEEE-style research project โ€” Design, implementation and experimental evaluation of cache-aware matrix multiplication in C, targeting x86-64 architecture with AVX2/FMA SIMD vectorization.

Authors: Dila ร–ykรผ EyรผboฤŸlu ยท Burcu KรถsedaฤŸฤฑ
Affiliation: Computer Engineering, Istanbul Commerce University
Platform: Intel Core i5-10210U ยท Linux WSL2 ยท GCC 13.3 ยท AVX2 + FMA


Key Results

  • 40.25ร— speedup over naive baseline
  • 455ร— reduction in LLC cache misses
  • 98.4% โ†’ 3.9% D1 miss rate reduction
  • Fully verified with Valgrind & Cachegrind
Matrix Size Naรฏve Time Optimized Time SIMD Time Speedup (Opt) Speedup (SIMD)
N = 256 0.027 s 0.005 s 0.008 s 5.15ร— 3.33ร—
N = 512 0.266 s 0.054 s 0.078 s 4.94ร— 3.40ร—
N = 1024 8.007 s 0.434 s 0.591 s 18.46ร— 13.54ร—
N = 2048 142.5 s 3.54 s 5.18 s 40.25ร— 27.51ร—

Table of Contents


Overview

Matrix multiplication (C = A ร— B) is a foundational computational kernel in machine learning, scientific simulation, and graphics rendering. Despite its O(Nยณ) arithmetic complexity, real-world performance is dictated by memory access patterns, not raw floating-point throughput.

This project demonstrates the memory wall effect quantitatively:

Naรฏve (N=2048):       0.12 GFLOPS   โ† 90% degradation from N=256
Optimized (N=2048):   4.85 GFLOPS   โ† stable across all sizes
SIMD (N=2048):        3.32 GFLOPS   โ† stable across all sizes

The naรฏve implementation accesses matrix B column-wise (stride-N), causing virtually every access to miss L1 cache. Loop interchange, tiling, and vectorization eliminate this bottleneck.


Implementations

1. Naรฏve (multiply_naive.c)

Standard i โ†’ j โ†’ k loop order. B is accessed with stride N โ€” cache-hostile.

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i*N+j] += A[i*N+k] * B[k*N+j];  // stride-N miss on B

Problem: For N=1024, each B access spans 8,192 bytes between rows โ†’ L1 miss every iteration.


2. Optimized (multiply_opt.c)

Three combined techniques:

โ‘  Loop Interchange (i โ†’ k โ†’ j)

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++) {
        double a_ik = A[i*N+k];   // hoisted to register
        for (j = 0; j < N; j++)
            C[i*N+j] += a_ik * B[k*N+j];  // stride-1 on B and C
    }

โ‘ก Cache Blocking (BLOCK_SIZE = 64)

#define BLOCK_SIZE 64
for (ii = 0; ii < N; ii += BLOCK_SIZE)
  for (kk = 0; kk < N; kk += BLOCK_SIZE)
    for (jj = 0; jj < N; jj += BLOCK_SIZE)
      // inner 64ร—64 tile fits in 32 KB L1-D cache

A single 64ร—64 tile = 32,768 bytes = one L1 cache instance. Three tiles (A, B, C) = 96 KB โ†’ fit in L2 (256 KB).

โ‘ข Manual Loop Unrolling (ร—4)

for (; j + 3 < j_max; j += 4) {
    C[i*N+j]   += a_ik * B[k*N+j];
    C[i*N+j+1] += a_ik * B[k*N+j+1];
    C[i*N+j+2] += a_ik * B[k*N+j+2];
    C[i*N+j+3] += a_ik * B[k*N+j+3];
}
for (; j < j_max; j++)   // scalar remainder
    C[i*N+j] += a_ik * B[k*N+j];

3. SIMD AVX2 (multiply_simd.c)

Replaces the scalar inner loop with 256-bit FMA intrinsics, processing 4 doubles per instruction:

#ifdef __AVX2__
#include <immintrin.h>
__m256d va = _mm256_set1_pd(a_ik);         // broadcast scalar to 4 lanes
for (; j + 3 < j_max; j += 4) {
    __m256d vb = _mm256_loadu_pd(&B[k*N+j]);
    __m256d vc = _mm256_loadu_pd(&C[i*N+j]);
    vc = _mm256_fmadd_pd(va, vb, vc);       // vc += va ร— vb (4 doubles/cycle)
    _mm256_storeu_pd(&C[i*N+j], vc);
}
#endif

GCC-emitted assembly (verified via gcc -S):

vbroadcastsd  %xmm2, %ymm1       ; broadcast a_ik to all 4 lanes
vmovupd       (%r8,%rcx,8), %ymm0 ; load 4 B values
vfmadd213pd   (%rdi), %ymm1, %ymm0 ; fused multiply-add (4 doubles)
vmovupd       %ymm0, (%rdi)        ; store 4 C values

Project Structure

cache_optimization_project/
โ”‚
โ”œโ”€โ”€ include/
โ”‚   โ”œโ”€โ”€ matrix.h          # Matrix struct + allocation API
โ”‚   โ”œโ”€โ”€ multiply.h        # Function declarations (naive / opt / simd)
โ”‚   โ”œโ”€โ”€ timer.h           # High-resolution timer (POSIX + Win32)
โ”‚   โ””โ”€โ”€ verify.h          # Element-wise correctness check
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ main.c            # Benchmark driver + CSV export
โ”‚   โ”œโ”€โ”€ matrix.c          # create_matrix(), free_matrix(), fill_matrix()
โ”‚   โ”œโ”€โ”€ multiply_naive.c  # Baseline iโ†’jโ†’k
โ”‚   โ”œโ”€โ”€ multiply_opt.c    # Blocked + unrolled (iโ†’kโ†’j, BLOCK_SIZE=64)
โ”‚   โ”œโ”€โ”€ multiply_simd.c   # AVX2 FMA vectorized
โ”‚   โ”œโ”€โ”€ timer.c           # clock_gettime / QueryPerformanceCounter
โ”‚   โ””โ”€โ”€ verify.c          # compare_matrices() with ฮต = 1ร—10โปโน
โ”‚
โ”œโ”€โ”€ figures/
โ”‚   โ”œโ”€โ”€ speedup_plot.png          # Speedup vs. N
โ”‚   โ”œโ”€โ”€ gflops_plot.png           # Throughput vs. N
โ”‚   โ”œโ”€โ”€ runtime_plot.jpg          # Execution time (log scale)
โ”‚   โ”œโ”€โ”€ cache_miss_plot.jpg       # D1 & LLd miss rates (Cachegrind)
โ”‚   โ”œโ”€โ”€ memory_hierarchy.jpg      # x86-64 cache hierarchy diagram
โ”‚   โ””โ”€โ”€ system_architecture.jpg  # Benchmark pipeline diagram
โ”‚
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ benchmark_results_aggregated.csv  # Median of 5 runs per config
โ”‚   โ”œโ”€โ”€ benchmark_results_all_runs.csv    # Raw per-run data
โ”‚   โ””โ”€โ”€ cachegrind_report.txt            # Valgrind cache profiling output
โ”‚
โ”œโ”€โ”€ paper/
โ”‚   โ””โ”€โ”€ Cache_Optimized_Matrix_Multiplication.pdf
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ run_cachegrind.sh   # Valgrind Cachegrind profiling (Linux)
โ”‚   โ””โ”€โ”€ run_perf.sh         # Linux perf hardware counters
โ”‚
โ”œโ”€โ”€ makefile
โ””โ”€โ”€ README.md

Requirements

Tool Version Purpose
GCC โ‰ฅ 9.0 Compilation with AVX2 (-mavx2 -mfma)
make any Build system
valgrind + cg_annotate any Cache profiling (make cachegrind)
perf any Hardware performance counters (make perf)

Install on Linux (Ubuntu/Debian)

sudo apt update
sudo apt install gcc make valgrind linux-tools-generic

Windows (MinGW/MSYS2)

AVX2 supported; valgrind and perf targets are Linux-only.

Verify AVX2 support

lscpu | grep -E "avx2|fma"
# Expected: avx2 fma in Flags section

Build & Run

Build with AVX2 SIMD (default)

make

Compiler flags: -O3 -Wall -Wextra -std=c11 -march=native -mavx2 -mfma

Build without SIMD (scalar fallback)

make nosimd

Run full benchmark suite

make run
# or directly:
./matrix_app

Benchmarks N โˆˆ {256, 512, 1024, 2048}, 5 runs each. Results written to benchmark_results.csv.

Inspect AVX2 assembly

gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
grep -A5 "vfmadd" multiply_simd.s

Clean build artifacts

make clean

Output Format

Terminal Output (example at N=1024)

AVX2 ENABLED
Cache-Optimized Matrix Multiplication Benchmark

N = 1024
C[0][0]     naive=2048.00   opt=2048.00   simd=2048.00
Correctness opt=PASS         simd=PASS
Time(s)     naive=8.007      opt=0.434     simd=0.591
Speedup     opt=18.46x       simd=13.54x
GFLOPS      naive=0.268      opt=4.951     simd=3.631

CSV Columns (benchmark_results.csv)

Column Description
N Matrix size
Correct_Opt Correctness of optimized (PASS/FAIL)
Correct_SIMD Correctness of SIMD (PASS/FAIL)
NaiveTime Median wall-clock time (s)
OptTime Median wall-clock time (s)
SIMDTime Median wall-clock time (s)
Speedup_Opt T_naive / T_opt
Speedup_SIMD T_naive / T_simd
NaiveGFLOPS 2Nยณ / (T ร— 10โน)
OptGFLOPS 2Nยณ / (T ร— 10โน)
SIMDGFLOPS 2Nยณ / (T ร— 10โน)

Cache Profiling

Run Valgrind Cachegrind

make cachegrind
# or manually:
valgrind --tool=cachegrind --cache-sim=yes --branch-sim=yes \
         --cachegrind-out-file=cachegrind.out ./matrix_app

cg_annotate --show=Ir,Dr,D1mr,DLmr,Dw,D1mw,DLmw \
            cachegrind.out > cachegrind_report.txt

Results at N=1024 (Valgrind Cachegrind)

Implementation D1 Read Misses LLd Read Misses D1 Miss Rate LLd Miss Rate
Naรฏve โ‰ˆ1.08ร—10โน โ‰ˆ1.07ร—10โน โ‰ˆ98.4% โ‰ˆ93.1%
Optimized โ‰ˆ1.56ร—10โธ โ‰ˆ2.36ร—10โถ โ‰ˆ3.9% โ‰ˆ0.19%
SIMD โ‰ˆ1.56ร—10โธ โ‰ˆ2.36ร—10โถ โ‰ˆ3.8% โ‰ˆ0.18%

455ร— reduction in LLd read misses ยท 25ร— reduction in D1 miss rate


Performance Counters

make perf
# or:
bash scripts/run_perf.sh

Reports hardware events (real cache misses, CPI, pipeline stalls) via Linux perf stat. Output saved to perf_report.txt.


Correctness Verification

All implementations are verified against the naรฏve baseline with element-wise tolerance ฮต = 1ร—10โปโน:

โˆ€ i, j :  |C_opt[i][j] โˆ’ C_naive[i][j]| โ‰ค 1ร—10โปโน

Test case: A[i][j] = 1.0, B[i][j] = 2.0 โ†’ expected C[i][j] = 2N for all i, j.

All three implementations pass for all N across all 5 runs.

Memory safety: Verified with valgrind --tool=memcheck โ€” zero leaks, zero invalid accesses across all configurations.


Optimization Techniques

Why naรฏve is slow

In the i โ†’ j โ†’ k loop order, B[k][j] is accessed column-wise:

  • For N=1024: stride = 1024 ร— 8 bytes = 8,192 bytes between consecutive accesses
  • L1 cache (32 KB) holds only 512 doubles โ†’ no reuse of B in L1
  • Result: D1 miss rate โ‰ˆ 98.4%, CPU stalls on every inner-loop iteration

Why the optimized version is fast

Technique Effect
Loop interchange (iโ†’kโ†’j) All three matrices traversed stride-1; a_ik hoisted to register
Cache blocking (64ร—64) Working set (3 tiles = 96 KB) fits in L2; misses drop from O(Nยณ) to O(Nยฒ/B)
Loop unrolling (ร—4) 4 independent MACs per iteration; improves instruction-level parallelism
AVX2 FMA 4 doubles per instruction; fused multiply-add halves arithmetic instruction count

Memory hierarchy fit (Intel Core i5-10210U)

CPU Registers  (YMM) โ”‚ 1 cycle      โ”‚ โ† a_ik hoisted here
L1-D Cache  32 KB    โ”‚ 4โ€“5 cycles   โ”‚ โ† one 64ร—64 tile (32 KB) fits here
L2 Cache   256 KB    โ”‚ ~12 cycles   โ”‚ โ† three tiles (96 KB) fit here
L3 Cache     6 MiB   โ”‚ 30โ€“40 cycles โ”‚
DRAM        16 GB    โ”‚ 200โ€“300 cyc  โ”‚ โ† naรฏve accesses land here

Experimental Setup

Parameter Value
CPU Intel Core i5-10210U (Comet Lake), 4C/8T, 1.6โ€“4.2 GHz
L1-D 128 KB (4 ร— 32 KB, 8-way, 64B lines)
L2 1 MiB (4 ร— 256 KB)
L3 6 MiB shared
OS Linux WSL2 (Ubuntu 24.04), kernel 6.6.87.2
Compiler GCC 13.3.0, -O3 -march=native -mavx2 -mfma
Matrix sizes N โˆˆ {256, 512, 1024, 2048}, square, double precision
Runs 5 independent; median reported
Timer clock_gettime(CLOCK_MONOTONIC)

Reproducibility

The full benchmark suite, Cachegrind profiling, and assembly inspection can each be triggered with a single command:

# Full benchmark
make clean && make && make run

# Valgrind cache profiling
make cachegrind

# AVX2 assembly inspection
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s

# Scalar build (no SIMD) for comparison
make nosimd && make run

All results are written to results/benchmark_results_aggregated.csv for reproducible post-processing.


References

Selected key references (full list in paper):

  1. Drepper, U. โ€” What Every Programmer Should Know About Memory, Red Hat, 2007
  2. Lam, Rothberg, Wolf โ€” The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS IV, 1991
  3. Williams, Waterman, Patterson โ€” Roofline: An Insightful Visual Performance Model, CACM, 2009
  4. Goto & van de Geijn โ€” Anatomy of High-Performance Matrix Multiplication, ACM TOMS, 2008
  5. Intel Corporation โ€” Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual, 2023

For the full experimental analysis, see paper/Cache_Optimized_Matrix_Multiplication.pdf.

About

High-performance matrix multiplication in C with cache blocking, loop optimizations, and AVX2 SIMD acceleration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors