Skip to content

gsvgit/ImageProcessing

Repository files navigation

Brahma.FSharp GPGPU Examples: Image Processing & Matrix Multiplication

GitHub Actions
GitHub Actions
Build History

This repository contains practical, educational examples of General-Purpose computing on Graphics Processing Units (GPGPU) using the F# programming language. It serves as a hands-on guide to leveraging the Brahma.FSharp library for writing parallel code that executes on OpenCL-compatible devices like GPUs.

The primary goal is to demonstrate how to accelerate common computational problems by offloading them from the CPU to the GPU, showcasing both the performance potential and the implementation patterns in F#.

Few example how to utilize GPGPU in F# code using Brahma.FSharp.

✨ Features

This project currently includes two classic GPGPU examples:

  1. Image Convolution: Applies various filters (like blur, sharpen, edge detection) to images. This operation is inherently parallel, as each output pixel can be computed independently from its neighbors, making it an ideal candidate for GPU acceleration. (Located in src/ImageProcessing/).

  2. Matrix Multiplication: Implements the multiplication of two large matrices on the GPU. This is a fundamental operation in many scientific and engineering domains and perfectly illustrates data-parallel computing. (Located in src/MatrixMultiplication/ ). Inspired by Cedric Nugteren's OpenCL SGEMM tutorial.

    Implemented kernels (K0–K4), each building on the previous with progressive optimizations:

    Kernel Description
    K0 Naive: each thread computes one output element, adding each pairwise product directly to the global memory cell of the result matrix
    K1 Local accumulator: each thread computes one output element using a mutable local register before writing to global memory once
    K2 Local memory tiling: tiles of both input matrices are loaded into local memory for reuse, each thread computes one output element
    K3 Increased work per thread: each thread computes WPT output elements from tiles in local memory
    K4 2D register blocking: each thread computes a TTS × TTS tile of the output for maximal data reuse

Both examples are designed to be simple to understand while demonstrating core concepts like kernel definition, memory management, and execution on a compute device.


📁 Repository Structure

The project is organized for clarity and ease of navigation:

  • src/: Contains all source code.
    • ImageProcessing/: The image convolution example and related logic.
    • MatrixMultiplication/: The matrix multiplication implementation.
  • tests/: Unit tests for the examples, ensuring correctness.
    • ImageProcessing.Tests/
    • MatrixMultiplication.Tests/
  • benchmarks/: Performance benchmarks.
    • MatrixMultiplication.Benchmarks: The matrix multiplication benchmarks.
  • .github/workflows/: GitHub Actions CI/CD pipelines for automated building and testing.

🚀 Getting Started

Follow these instructions to get the project up and running on your local machine for development and experimentation.

Prerequisites

Before you begin, ensure you have the following installed:

  • .NET 9.0 SDK or higher.
  • Option A (Recommended for GPU acceleration): An OpenCL-compatible device (e.g., a discrete or integrated GPU) with the respective vendor driver installed. (e.g., NVIDIA drivers for NVIDIA GPUs, ROCm or AMD drivers for AMD GPUs, or Intel OpenCL runtime for Intel GPUs/CPUs).
  • Option B (CPU fallback - great for testing/learning): If you don't have a GPU or want to experiment on CPU first, install POCL (Portable Computing Language). POCL is an open-source OpenCL implementation that runs on CPUs, allowing you to run all examples without dedicated graphics hardware.

Installation & Build

  1. Clone the repository:

    git clone https://github.com/gsvgit/ImageProcessing.git
    cd ImageProcessing
  2. Build the project: This command compiles the code and restores any necessary NuGet packages.

    dotnet build -c Release

📊 Matrix Multiplication Benchmarks

The benchmarks/MatrixMultiplication.Benchmarks/ project uses BenchmarkDotNet to measure GPU kernel execution times for all 5 matrix multiplication kernels (K0–K4) across matrix sizes 256–2048 and various work-group configurations.

Benchmark classes

Class Extra params Kernel
K0Benchmark multiplyKernel0
K1Benchmark multiplyKernel1
K2Benchmark multiplyKernel2
K3Benchmark WPT: 1, 2, 4, 8 multiplyKernel3 with workPerThread
K4Benchmark TTS: 1, 2, 4, 8 multiplyKernel4 with threadTileSize

Common parameters across all classes:

  • N — matrix size: 256, 512, 1024, 2048
  • LWS — local work size: 8, 16, 32, 64, 128, 256 (device-dependent, some values may be invalid)

Design

  • Measurement: posts kernel command (async via MailboxProcessor) then synchronizes with CreateToHostMsg on a 1-element buffer — measures wall-clock GPU execution time
  • Data transfer excluded: buffers are allocated and filled with random data in [GlobalSetup], outside the timed portion
  • Cleanup: CreateFreeMsg on all ClArray buffers in [GlobalCleanup]
  • Invalid configs: fail in [GlobalSetup] with descriptive message → BenchmarkDotNet marks as NA and continues

How to run

# Full run all kernels (default device):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks

# Quick smoke test (ShortRun = 3 warmup + 3 actual iterations):
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --job short --filter *K0Benchmark*

# Selective kernels:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --filter *K3Benchmark*

# Specific OpenCL device:
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device nvidia
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device intel
dotnet run -c Release --project benchmarks/MatrixMultiplication.Benchmarks -- --device cpu

BenchmarkDotNet passes remaining CLI arguments (like --filter, --job, --stopOnFirstError) through to its own parser. Results are exported as CSV, Markdown, and HTML to BenchmarkDotNet.Artifacts/results/.

About

Examples of GPGPU utilization in F# using Brahma.FSharp

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors