CUDA Parallel Algorithms Library

A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.

Architecture

Five-Layer Design

┌─────────────────────────────────────────────────────────────┐
│  Layer 3: High-Level API (STL-style)                        │
│  cuda::reduce(), cuda::sort()                              │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 2: Algorithm Wrappers                                │
│  cuda::algo::reduce_sum(), memory management               │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 1: Device Kernels                                   │
│  Pure __global__ kernels, no memory allocation             │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 0: Memory Foundation                                │
│  Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts   │
└─────────────────────────────────────────────────────────────┘

Directory Structure

include/cuda/
├── memory/               # Layer 0: Memory Foundation
│   ├── buffer.h         # cuda::memory::Buffer<T>
│   ├── unique_ptr.h     # cuda::memory::unique_ptr<T>
│   ├── memory_pool.h    # MemoryPool for allocation
│   └── allocator.h      # Allocator concepts
├── device/              # Layer 1: Device Kernels
│   ├── reduce_kernels.h
│   ├── scan_kernels.h
│   └── device_utils.h   # CUDA_CHECK, warp_reduce
├── algo/                 # Layer 2: Algorithm Wrappers
│   ├── reduce.h
│   ├── scan.h
│   └── sort.h
└── api/                  # Layer 3: High-Level API
    ├── device_vector.h   # STL-style device container
    ├── stream.h          # Stream and Event wrappers
    └── config.h          # Algorithm configuration objects

include/
├── image/               # Image processing
│   ├── types.h
│   ├── brightness.h
│   ├── gaussian_blur.h
│   ├── sobel_edge.h
│   └── morphology.h
├── parallel/            # Parallel primitives
│   ├── scan.h
│   ├── sort.h
│   └── histogram.h
├── matrix/              # Matrix operations
│   ├── add.h
│   ├── mult.h
│   └── ops.h
└── convolution/         # Convolution
    └── conv2d.h

src/
├── memory/               # Layer 0 implementations
├── cuda/
│   ├── device/           # Layer 1 implementations
│   └── algo/             # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/

Layer Responsibilities

Layer	Namespace	Purpose	Dependencies
Layer 0	`cuda::memory`	Memory allocation, RAII, pooling	CUDA runtime
Layer 1	`cuda::device`	Pure device kernels	Layer 0
Layer 2	`cuda::algo`	Memory management, algorithms	Layers 0, 1
Layer 3	`cuda::api`	STL-style containers	Layers 0, 1, 2

Quick Start

Build

git clone https://github.com/pplmx/nova.git
cd nova
make build

Run Demo

make run

Run Tests

make test          # Run all tests
make test-unit     # Run algorithm tests

Usage Examples

Layer 0: Memory Foundation

#include "cuda/memory/buffer.h"

// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);

// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);

Layer 2: Algorithm API

#include "cuda/algo/reduce.h"

// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);

Layer 3: High-Level API

#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"

// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());

// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();

// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();

Modules

Module	Files	Description
cuda::memory	Buffer, unique_ptr, MemoryPool, Allocator	Memory management
cuda::device	device_utils, reduce_kernels	Pure CUDA kernels
cuda::algo	reduce wrappers, device_buffer	Algorithm orchestration
cuda::api	DeviceVector, Stream, Event, Config	High-level API
image	types, brightness, gaussian_blur, sobel, morphology	Image processing
parallel	scan, sort, histogram	Parallel primitives
matrix	add, mult, ops	Matrix operations
convolution	conv2d	2D convolution

Testing

81 tests across 13 test suites, all passing:

Test Suite	Tests
ReduceTest	11
ScanTest	10
SortTest	7
OddEvenSortTest	3
MatrixMultTest	7
MatrixOpsTest	16
ImageBufferTest	5
GaussianBlurTest	7
SobelTest	7
BrightnessTest	10
TestPatternsTest	14

Development

Makefile Targets

Target	Description
`make build`	Configure and build project
`make run`	Run benchmark demo
`make test`	Run all tests (81 tests)
`make clean`	Clean build artifacts

Requirements

CUDA Toolkit 12+
CMake 3.25+
C++20 compatible compiler
CUDA-capable GPU

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github		.github
cmake		cmake
data		data
docs/superpowers		docs/superpowers
include		include
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
compose.yml		compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Parallel Algorithms Library

Architecture

Five-Layer Design

Directory Structure

Layer Responsibilities

Quick Start

Build

Run Demo

Run Tests

Usage Examples

Layer 0: Memory Foundation

Layer 2: Algorithm API

Layer 3: High-Level API

Modules

Testing

Development

Makefile Targets

Requirements

License

Contribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Parallel Algorithms Library

Architecture

Five-Layer Design

Directory Structure

Layer Responsibilities

Quick Start

Build

Run Demo

Run Tests

Usage Examples

Layer 0: Memory Foundation

Layer 2: Algorithm API

Layer 3: High-Level API

Modules

Testing

Development

Makefile Targets

Requirements

License

Contribution

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages