| GitHub Actions |
|---|
This repository contains practical, educational examples of General-Purpose computing on Graphics Processing Units (GPGPU) using the F# programming language. It serves as a hands-on guide to leveraging the Brahma.FSharp library for writing parallel code that executes on OpenCL-compatible devices like GPUs.
The primary goal is to demonstrate how to accelerate common computational problems by offloading them from the CPU to the GPU, showcasing both the performance potential and the implementation patterns in F#.
Few example how to utilize GPGPU in F# code using Brahma.FSharp.
This project currently includes two classic GPGPU examples:
-
Image Convolution: Applies various filters (like blur, sharpen, edge detection) to images. This operation is inherently parallel, as each output pixel can be computed independently from its neighbors, making it an ideal candidate for GPU acceleration. (Located in
src/ImageProcessing/). -
Matrix Multiplication: Implements the multiplication of two large matrices on the GPU. This is a fundamental operation in many scientific and engineering domains and perfectly illustrates data-parallel computing. (Located in
src/MatrixMultiplication/). Inspired by Cedric Nugteren's OpenCL SGEMM tutorial.Implemented kernels (K0–K4), each building on the previous with progressive optimizations:
Kernel Description K0 Naive: each thread computes one output element, adding each pairwise product directly to the global memory cell of the result matrix K1 Local accumulator: each thread computes one output element using a mutable local register before writing to global memory once K2 Local memory tiling: tiles of both input matrices are loaded into local memory for reuse, each thread computes one output element K3 Increased work per thread: each thread computes WPToutput elements from tiles in local memoryK4 2D register blocking: each thread computes a TTS × TTStile of the output for maximal data reuse
Both examples are designed to be simple to understand while demonstrating core concepts like kernel definition, memory management, and execution on a compute device.
The project is organized for clarity and ease of navigation:
src/: Contains all source code.ImageProcessing/: The image convolution example and related logic.MatrixMultiplication/: The matrix multiplication implementation.
tests/: Unit tests for the examples, ensuring correctness.ImageProcessing.Tests/MatrixMultiplication.Tests/
benchmarks/: Performance benchmarks.MatrixMultiplication.Benchmarks: The matrix multiplication benchmarks.
.github/workflows/: GitHub Actions CI/CD pipelines for automated building and testing.
Follow these instructions to get the project up and running on your local machine for development and experimentation.
Before you begin, ensure you have the following installed:
- .NET 9.0 SDK or higher.
- Option A (Recommended for GPU acceleration): An OpenCL-compatible device (e.g., a discrete or integrated GPU) with the respective vendor driver installed. (e.g., NVIDIA drivers for NVIDIA GPUs, ROCm or AMD drivers for AMD GPUs, or Intel OpenCL runtime for Intel GPUs/CPUs).
- Option B (CPU fallback - great for testing/learning): If you don't have a GPU or want to experiment on CPU first, install POCL (Portable Computing Language). POCL is an open-source OpenCL implementation that runs on CPUs, allowing you to run all examples without dedicated graphics hardware.
- On Ubuntu/Debian:
sudo apt install pocl-opencl-icd - Check the official POCL installation guide for installation options.
- On Ubuntu/Debian:
-
Clone the repository:
git clone https://github.com/gsvgit/ImageProcessing.git cd ImageProcessing -
Build the project: This command compiles the code and restores any necessary NuGet packages.
dotnet build -c Release
The benchmarks/MatrixMultiplication.Benchmarks/ project uses BenchmarkDotNet to measure GPU kernel execution times for all 5 matrix multiplication kernels (K0–K4) across matrix sizes 256–2048 and various work-group configurations.
| Class | Extra params | Kernel |
|---|---|---|
K0Benchmark |
— | multiplyKernel0 |
K1Benchmark |
— | multiplyKernel1 |
K2Benchmark |
— | multiplyKernel2 |
K3Benchmark |
WPT: 1, 2, 4, 8 |
multiplyKernel3 with workPerThread |
K4Benchmark |
TTS: 1, 2, 4, 8 |
multiplyKernel4 with threadTileSize |
Common parameters across all classes:
- N — matrix size: 256, 512, 1024, 2048
- LWS — local work size: 8, 16, 32, 64, 128, 256 (device-dependent, some values may be invalid)
- Measurement: posts kernel command (async via
MailboxProcessor) then synchronizes withCreateToHostMsgon a 1-element buffer — measures wall-clock GPU execution time - Data transfer excluded: buffers are allocated and filled with random data in
[GlobalSetup], outside the timed portion - Cleanup:
CreateFreeMsgon allClArraybuffers in[GlobalCleanup] - Invalid configs: fail in
[GlobalSetup]with descriptive message → BenchmarkDotNet marks asNAand continues
# Full run all kernels (default device):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks
# Quick smoke test (ShortRun = 3 warmup + 3 actual iterations):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --job short --filter *K0Benchmark*
# Selective kernels:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --filter *K3Benchmark*
# Specific OpenCL device:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device nvidia
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device intel
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device cpuBenchmarkDotNet passes remaining CLI arguments (like --filter, --job, --stopOnFirstError) through to its own parser. Results are exported as CSV, Markdown, and HTML to BenchmarkDotNet.Artifacts/results/.