IEEE-style research project โ Design, implementation and experimental evaluation of cache-aware matrix multiplication in C, targeting x86-64 architecture with AVX2/FMA SIMD vectorization.
Authors: Dila รykรผ Eyรผboฤlu ยท Burcu Kรถsedaฤฤฑ
Affiliation: Computer Engineering, Istanbul Commerce University
Platform: Intel Core i5-10210U ยท Linux WSL2 ยท GCC 13.3 ยท AVX2 + FMA
- 40.25ร speedup over naive baseline
- 455ร reduction in LLC cache misses
- 98.4% โ 3.9% D1 miss rate reduction
- Fully verified with Valgrind & Cachegrind
| Matrix Size | Naรฏve Time | Optimized Time | SIMD Time | Speedup (Opt) | Speedup (SIMD) |
|---|---|---|---|---|---|
| N = 256 | 0.027 s | 0.005 s | 0.008 s | 5.15ร | 3.33ร |
| N = 512 | 0.266 s | 0.054 s | 0.078 s | 4.94ร | 3.40ร |
| N = 1024 | 8.007 s | 0.434 s | 0.591 s | 18.46ร | 13.54ร |
| N = 2048 | 142.5 s | 3.54 s | 5.18 s | 40.25ร | 27.51ร |
- Overview
- Implementations
- Project Structure
- Requirements
- Build & Run
- Output Format
- Cache Profiling
- Performance Counters
- Correctness Verification
- Optimization Techniques
- Architecture Notes
- Reproducibility
Matrix multiplication (C = A ร B) is a foundational computational kernel in machine learning, scientific simulation, and graphics rendering. Despite its O(Nยณ) arithmetic complexity, real-world performance is dictated by memory access patterns, not raw floating-point throughput.
This project demonstrates the memory wall effect quantitatively:
Naรฏve (N=2048): 0.12 GFLOPS โ 90% degradation from N=256
Optimized (N=2048): 4.85 GFLOPS โ stable across all sizes
SIMD (N=2048): 3.32 GFLOPS โ stable across all sizes
The naรฏve implementation accesses matrix B column-wise (stride-N), causing virtually every access to miss L1 cache. Loop interchange, tiling, and vectorization eliminate this bottleneck.
Standard i โ j โ k loop order. B is accessed with stride N โ cache-hostile.
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i*N+j] += A[i*N+k] * B[k*N+j]; // stride-N miss on BProblem: For N=1024, each B access spans 8,192 bytes between rows โ L1 miss every iteration.
Three combined techniques:
โ Loop Interchange (i โ k โ j)
for (i = 0; i < N; i++)
for (k = 0; k < N; k++) {
double a_ik = A[i*N+k]; // hoisted to register
for (j = 0; j < N; j++)
C[i*N+j] += a_ik * B[k*N+j]; // stride-1 on B and C
}โก Cache Blocking (BLOCK_SIZE = 64)
#define BLOCK_SIZE 64
for (ii = 0; ii < N; ii += BLOCK_SIZE)
for (kk = 0; kk < N; kk += BLOCK_SIZE)
for (jj = 0; jj < N; jj += BLOCK_SIZE)
// inner 64ร64 tile fits in 32 KB L1-D cacheA single 64ร64 tile = 32,768 bytes = one L1 cache instance. Three tiles (A, B, C) = 96 KB โ fit in L2 (256 KB).
โข Manual Loop Unrolling (ร4)
for (; j + 3 < j_max; j += 4) {
C[i*N+j] += a_ik * B[k*N+j];
C[i*N+j+1] += a_ik * B[k*N+j+1];
C[i*N+j+2] += a_ik * B[k*N+j+2];
C[i*N+j+3] += a_ik * B[k*N+j+3];
}
for (; j < j_max; j++) // scalar remainder
C[i*N+j] += a_ik * B[k*N+j];Replaces the scalar inner loop with 256-bit FMA intrinsics, processing 4 doubles per instruction:
#ifdef __AVX2__
#include <immintrin.h>
__m256d va = _mm256_set1_pd(a_ik); // broadcast scalar to 4 lanes
for (; j + 3 < j_max; j += 4) {
__m256d vb = _mm256_loadu_pd(&B[k*N+j]);
__m256d vc = _mm256_loadu_pd(&C[i*N+j]);
vc = _mm256_fmadd_pd(va, vb, vc); // vc += va ร vb (4 doubles/cycle)
_mm256_storeu_pd(&C[i*N+j], vc);
}
#endifGCC-emitted assembly (verified via gcc -S):
vbroadcastsd %xmm2, %ymm1 ; broadcast a_ik to all 4 lanes
vmovupd (%r8,%rcx,8), %ymm0 ; load 4 B values
vfmadd213pd (%rdi), %ymm1, %ymm0 ; fused multiply-add (4 doubles)
vmovupd %ymm0, (%rdi) ; store 4 C valuescache_optimization_project/
โ
โโโ include/
โ โโโ matrix.h # Matrix struct + allocation API
โ โโโ multiply.h # Function declarations (naive / opt / simd)
โ โโโ timer.h # High-resolution timer (POSIX + Win32)
โ โโโ verify.h # Element-wise correctness check
โ
โโโ src/
โ โโโ main.c # Benchmark driver + CSV export
โ โโโ matrix.c # create_matrix(), free_matrix(), fill_matrix()
โ โโโ multiply_naive.c # Baseline iโjโk
โ โโโ multiply_opt.c # Blocked + unrolled (iโkโj, BLOCK_SIZE=64)
โ โโโ multiply_simd.c # AVX2 FMA vectorized
โ โโโ timer.c # clock_gettime / QueryPerformanceCounter
โ โโโ verify.c # compare_matrices() with ฮต = 1ร10โปโน
โ
โโโ figures/
โ โโโ speedup_plot.png # Speedup vs. N
โ โโโ gflops_plot.png # Throughput vs. N
โ โโโ runtime_plot.jpg # Execution time (log scale)
โ โโโ cache_miss_plot.jpg # D1 & LLd miss rates (Cachegrind)
โ โโโ memory_hierarchy.jpg # x86-64 cache hierarchy diagram
โ โโโ system_architecture.jpg # Benchmark pipeline diagram
โ
โโโ results/
โ โโโ benchmark_results_aggregated.csv # Median of 5 runs per config
โ โโโ benchmark_results_all_runs.csv # Raw per-run data
โ โโโ cachegrind_report.txt # Valgrind cache profiling output
โ
โโโ paper/
โ โโโ Cache_Optimized_Matrix_Multiplication.pdf
โ
โโโ scripts/
โ โโโ run_cachegrind.sh # Valgrind Cachegrind profiling (Linux)
โ โโโ run_perf.sh # Linux perf hardware counters
โ
โโโ makefile
โโโ README.md
| Tool | Version | Purpose |
|---|---|---|
| GCC | โฅ 9.0 | Compilation with AVX2 (-mavx2 -mfma) |
| make | any | Build system |
| valgrind + cg_annotate | any | Cache profiling (make cachegrind) |
| perf | any | Hardware performance counters (make perf) |
sudo apt update
sudo apt install gcc make valgrind linux-tools-genericAVX2 supported; valgrind and perf targets are Linux-only.
lscpu | grep -E "avx2|fma"
# Expected: avx2 fma in Flags sectionmakeCompiler flags: -O3 -Wall -Wextra -std=c11 -march=native -mavx2 -mfma
make nosimdmake run
# or directly:
./matrix_appBenchmarks N โ {256, 512, 1024, 2048}, 5 runs each. Results written to benchmark_results.csv.
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
grep -A5 "vfmadd" multiply_simd.smake cleanAVX2 ENABLED
Cache-Optimized Matrix Multiplication Benchmark
N = 1024
C[0][0] naive=2048.00 opt=2048.00 simd=2048.00
Correctness opt=PASS simd=PASS
Time(s) naive=8.007 opt=0.434 simd=0.591
Speedup opt=18.46x simd=13.54x
GFLOPS naive=0.268 opt=4.951 simd=3.631
| Column | Description |
|---|---|
N |
Matrix size |
Correct_Opt |
Correctness of optimized (PASS/FAIL) |
Correct_SIMD |
Correctness of SIMD (PASS/FAIL) |
NaiveTime |
Median wall-clock time (s) |
OptTime |
Median wall-clock time (s) |
SIMDTime |
Median wall-clock time (s) |
Speedup_Opt |
T_naive / T_opt |
Speedup_SIMD |
T_naive / T_simd |
NaiveGFLOPS |
2Nยณ / (T ร 10โน) |
OptGFLOPS |
2Nยณ / (T ร 10โน) |
SIMDGFLOPS |
2Nยณ / (T ร 10โน) |
make cachegrind
# or manually:
valgrind --tool=cachegrind --cache-sim=yes --branch-sim=yes \
--cachegrind-out-file=cachegrind.out ./matrix_app
cg_annotate --show=Ir,Dr,D1mr,DLmr,Dw,D1mw,DLmw \
cachegrind.out > cachegrind_report.txt| Implementation | D1 Read Misses | LLd Read Misses | D1 Miss Rate | LLd Miss Rate |
|---|---|---|---|---|
| Naรฏve | โ1.08ร10โน | โ1.07ร10โน | โ98.4% | โ93.1% |
| Optimized | โ1.56ร10โธ | โ2.36ร10โถ | โ3.9% | โ0.19% |
| SIMD | โ1.56ร10โธ | โ2.36ร10โถ | โ3.8% | โ0.18% |
455ร reduction in LLd read misses ยท 25ร reduction in D1 miss rate
make perf
# or:
bash scripts/run_perf.shReports hardware events (real cache misses, CPI, pipeline stalls) via Linux perf stat. Output saved to perf_report.txt.
All implementations are verified against the naรฏve baseline with element-wise tolerance ฮต = 1ร10โปโน:
โ i, j : |C_opt[i][j] โ C_naive[i][j]| โค 1ร10โปโน
Test case: A[i][j] = 1.0, B[i][j] = 2.0 โ expected C[i][j] = 2N for all i, j.
All three implementations pass for all N across all 5 runs.
Memory safety: Verified with valgrind --tool=memcheck โ zero leaks, zero invalid accesses across all configurations.
In the i โ j โ k loop order, B[k][j] is accessed column-wise:
- For N=1024: stride = 1024 ร 8 bytes = 8,192 bytes between consecutive accesses
- L1 cache (32 KB) holds only 512 doubles โ no reuse of B in L1
- Result: D1 miss rate โ 98.4%, CPU stalls on every inner-loop iteration
| Technique | Effect |
|---|---|
| Loop interchange (iโkโj) | All three matrices traversed stride-1; a_ik hoisted to register |
| Cache blocking (64ร64) | Working set (3 tiles = 96 KB) fits in L2; misses drop from O(Nยณ) to O(Nยฒ/B) |
| Loop unrolling (ร4) | 4 independent MACs per iteration; improves instruction-level parallelism |
| AVX2 FMA | 4 doubles per instruction; fused multiply-add halves arithmetic instruction count |
CPU Registers (YMM) โ 1 cycle โ โ a_ik hoisted here
L1-D Cache 32 KB โ 4โ5 cycles โ โ one 64ร64 tile (32 KB) fits here
L2 Cache 256 KB โ ~12 cycles โ โ three tiles (96 KB) fit here
L3 Cache 6 MiB โ 30โ40 cycles โ
DRAM 16 GB โ 200โ300 cyc โ โ naรฏve accesses land here
| Parameter | Value |
|---|---|
| CPU | Intel Core i5-10210U (Comet Lake), 4C/8T, 1.6โ4.2 GHz |
| L1-D | 128 KB (4 ร 32 KB, 8-way, 64B lines) |
| L2 | 1 MiB (4 ร 256 KB) |
| L3 | 6 MiB shared |
| OS | Linux WSL2 (Ubuntu 24.04), kernel 6.6.87.2 |
| Compiler | GCC 13.3.0, -O3 -march=native -mavx2 -mfma |
| Matrix sizes | N โ {256, 512, 1024, 2048}, square, double precision |
| Runs | 5 independent; median reported |
| Timer | clock_gettime(CLOCK_MONOTONIC) |
The full benchmark suite, Cachegrind profiling, and assembly inspection can each be triggered with a single command:
# Full benchmark
make clean && make && make run
# Valgrind cache profiling
make cachegrind
# AVX2 assembly inspection
gcc -O3 -mavx2 -mfma -I./include -S src/multiply_simd.c -o multiply_simd.s
# Scalar build (no SIMD) for comparison
make nosimd && make runAll results are written to results/benchmark_results_aggregated.csv for reproducible post-processing.
Selected key references (full list in paper):
- Drepper, U. โ What Every Programmer Should Know About Memory, Red Hat, 2007
- Lam, Rothberg, Wolf โ The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS IV, 1991
- Williams, Waterman, Patterson โ Roofline: An Insightful Visual Performance Model, CACM, 2009
- Goto & van de Geijn โ Anatomy of High-Performance Matrix Multiplication, ACM TOMS, 2008
- Intel Corporation โ Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual, 2023
For the full experimental analysis, see paper/Cache_Optimized_Matrix_Multiplication.pdf.