Brahma.FSharp GPGPU Examples: Image Processing & Matrix Multiplication

GitHub Actions

This repository contains practical, educational examples of General-Purpose computing on Graphics Processing Units (GPGPU) using the F# programming language. It serves as a hands-on guide to leveraging the Brahma.FSharp library for writing parallel code that executes on OpenCL-compatible devices like GPUs.

The primary goal is to demonstrate how to accelerate common computational problems by offloading them from the CPU to the GPU, showcasing both the performance potential and the implementation patterns in F#.

Few example how to utilize GPGPU in F# code using Brahma.FSharp.

✨ Features

This project currently includes two classic GPGPU examples:

Image Convolution: Applies various filters (like blur, sharpen, edge detection) to images. This operation is inherently parallel, as each output pixel can be computed independently from its neighbors, making it an ideal candidate for GPU acceleration. (Located in src/ImageProcessing/).

Matrix Multiplication: Implements the multiplication of two large matrices on the GPU. This is a fundamental operation in many scientific and engineering domains and perfectly illustrates data-parallel computing. (Located in src/MatrixMultiplication/ ). Inspired by Cedric Nugteren's OpenCL SGEMM tutorial.

Implemented kernels (K0–K4), each building on the previous with progressive optimizations:

Kernel	Description
K0	Naive: each thread computes one output element, adding each pairwise product directly to the global memory cell of the result matrix
K1	Local accumulator: each thread computes one output element using a mutable local register before writing to global memory once
K2	Local memory tiling: tiles of both input matrices are loaded into local memory for reuse, each thread computes one output element
K3	Increased work per thread: each thread computes `WPT` output elements from tiles in local memory
K4	2D register blocking: each thread computes a `TTS × TTS` tile of the output for maximal data reuse

Both examples are designed to be simple to understand while demonstrating core concepts like kernel definition, memory management, and execution on a compute device.

📁 Repository Structure

The project is organized for clarity and ease of navigation:

src/: Contains all source code.
- ImageProcessing/: The image convolution example and related logic.
- MatrixMultiplication/: The matrix multiplication implementation.
tests/: Unit tests for the examples, ensuring correctness.
- ImageProcessing.Tests/
- MatrixMultiplication.Tests/
benchmarks/: Performance benchmarks.
- MatrixMultiplication.Benchmarks: The matrix multiplication benchmarks.
.github/workflows/: GitHub Actions CI/CD pipelines for automated building and testing.

🚀 Getting Started

Follow these instructions to get the project up and running on your local machine for development and experimentation.

Prerequisites

Before you begin, ensure you have the following installed:

.NET 9.0 SDK or higher.
Option A (Recommended for GPU acceleration): An OpenCL-compatible device (e.g., a discrete or integrated GPU) with the respective vendor driver installed. (e.g., NVIDIA drivers for NVIDIA GPUs, ROCm or AMD drivers for AMD GPUs, or Intel OpenCL runtime for Intel GPUs/CPUs).
Option B (CPU fallback - great for testing/learning): If you don't have a GPU or want to experiment on CPU first, install POCL (Portable Computing Language). POCL is an open-source OpenCL implementation that runs on CPUs, allowing you to run all examples without dedicated graphics hardware.
- On Ubuntu/Debian: sudo apt install pocl-opencl-icd
- Check the official POCL installation guide for installation options.

Installation & Build

Clone the repository:

git clone https://github.com/gsvgit/ImageProcessing.git
cd ImageProcessing

Build the project: This command compiles the code and restores any necessary NuGet packages.
```
dotnet build -c Release
```

📊 Matrix Multiplication Benchmarks

The benchmarks/MatrixMultiplication.Benchmarks/ project uses BenchmarkDotNet to measure GPU kernel execution times for all 5 matrix multiplication kernels (K0–K4) across matrix sizes 256–2048 and various work-group configurations.

Benchmark classes

Class	Extra params	Kernel
`K0Benchmark`	—	`multiplyKernel0`
`K1Benchmark`	—	`multiplyKernel1`
`K2Benchmark`	—	`multiplyKernel2`
`K3Benchmark`	`WPT`: 1, 2, 4, 8	`multiplyKernel3` with `workPerThread`
`K4Benchmark`	`TTS`: 1, 2, 4, 8	`multiplyKernel4` with `threadTileSize`

Common parameters across all classes:

N — matrix size: 256, 512, 1024, 2048
LWS — local work size: 8, 16, 32, 64, 128, 256 (device-dependent, some values may be invalid)

Design

Measurement: posts kernel command (async via MailboxProcessor) then synchronizes with CreateToHostMsg on a 1-element buffer — measures wall-clock GPU execution time
Data transfer excluded: buffers are allocated and filled with random data in [GlobalSetup], outside the timed portion
Cleanup: CreateFreeMsg on all ClArray buffers in [GlobalCleanup]
Invalid configs: fail in [GlobalSetup] with descriptive message → BenchmarkDotNet marks as NA and continues

How to run

# Full run all kernels (default device):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks

# Quick smoke test (ShortRun = 3 warmup + 3 actual iterations):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --job short --filter *K0Benchmark*

# Selective kernels:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --filter *K3Benchmark*

# Specific OpenCL device:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device nvidia
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device intel
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device cpu

BenchmarkDotNet passes remaining CLI arguments (like --filter, --job, --stopOnFirstError) through to its own parser. Results are exported as CSV, Markdown, and HTML to BenchmarkDotNet.Artifacts/results/.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.config		.config
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
benchmarks/MatrixMultiplication.Benchmarks		benchmarks/MatrixMultiplication.Benchmarks
docs/coverage		docs/coverage
src		src
tests		tests
.editorconfig		.editorconfig
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Directory.Build.props		Directory.Build.props
ImageProcessing.sln		ImageProcessing.sln
LICENSE.md		LICENSE.md
README.md		README.md
global.json		global.json
tune.py		tune.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brahma.FSharp GPGPU Examples: Image Processing & Matrix Multiplication

✨ Features

📁 Repository Structure

🚀 Getting Started

Prerequisites

Installation & Build

📊 Matrix Multiplication Benchmarks

Benchmark classes

Design

How to run

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Brahma.FSharp GPGPU Examples: Image Processing & Matrix Multiplication

✨ Features

📁 Repository Structure

🚀 Getting Started

Prerequisites

Installation & Build

📊 Matrix Multiplication Benchmarks

Benchmark classes

Design

How to run

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages