A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: High-Level API (STL-style) │
│ cuda::reduce(), cuda::sort() │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Algorithm Wrappers │
│ cuda::algo::reduce_sum(), memory management │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Device Kernels │
│ Pure __global__ kernels, no memory allocation │
└─────────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────────┐
│ Layer 0: Memory Foundation │
│ Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts │
└─────────────────────────────────────────────────────────────┘
include/cuda/
├── memory/ # Layer 0: Memory Foundation
│ ├── buffer.h # cuda::memory::Buffer<T>
│ ├── unique_ptr.h # cuda::memory::unique_ptr<T>
│ ├── memory_pool.h # MemoryPool for allocation
│ └── allocator.h # Allocator concepts
├── device/ # Layer 1: Device Kernels
│ ├── reduce_kernels.h
│ ├── scan_kernels.h
│ └── device_utils.h # CUDA_CHECK, warp_reduce
├── algo/ # Layer 2: Algorithm Wrappers
│ ├── reduce.h
│ ├── scan.h
│ └── sort.h
└── api/ # Layer 3: High-Level API
├── device_vector.h # STL-style device container
├── stream.h # Stream and Event wrappers
└── config.h # Algorithm configuration objects
include/
├── image/ # Image processing
│ ├── types.h
│ ├── brightness.h
│ ├── gaussian_blur.h
│ ├── sobel_edge.h
│ └── morphology.h
├── parallel/ # Parallel primitives
│ ├── scan.h
│ ├── sort.h
│ └── histogram.h
├── matrix/ # Matrix operations
│ ├── add.h
│ ├── mult.h
│ └── ops.h
└── convolution/ # Convolution
└── conv2d.h
src/
├── memory/ # Layer 0 implementations
├── cuda/
│ ├── device/ # Layer 1 implementations
│ └── algo/ # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/
| Layer | Namespace | Purpose | Dependencies |
|---|---|---|---|
| Layer 0 | cuda::memory |
Memory allocation, RAII, pooling | CUDA runtime |
| Layer 1 | cuda::device |
Pure device kernels | Layer 0 |
| Layer 2 | cuda::algo |
Memory management, algorithms | Layers 0, 1 |
| Layer 3 | cuda::api |
STL-style containers | Layers 0, 1, 2 |
git clone https://github.com/pplmx/nova.git
cd nova
make buildmake runmake test # Run all tests
make test-unit # Run algorithm tests#include "cuda/memory/buffer.h"
// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);
// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);#include "cuda/algo/reduce.h"
// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"
// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());
// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();
// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();| Module | Files | Description |
|---|---|---|
| cuda::memory | Buffer, unique_ptr, MemoryPool, Allocator | Memory management |
| cuda::device | device_utils, reduce_kernels | Pure CUDA kernels |
| cuda::algo | reduce wrappers, device_buffer | Algorithm orchestration |
| cuda::api | DeviceVector, Stream, Event, Config | High-level API |
| image | types, brightness, gaussian_blur, sobel, morphology | Image processing |
| parallel | scan, sort, histogram | Parallel primitives |
| matrix | add, mult, ops | Matrix operations |
| convolution | conv2d | 2D convolution |
81 tests across 13 test suites, all passing:
| Test Suite | Tests |
|---|---|
| ReduceTest | 11 |
| ScanTest | 10 |
| SortTest | 7 |
| OddEvenSortTest | 3 |
| MatrixMultTest | 7 |
| MatrixOpsTest | 16 |
| ImageBufferTest | 5 |
| GaussianBlurTest | 7 |
| SobelTest | 7 |
| BrightnessTest | 10 |
| TestPatternsTest | 14 |
| Target | Description |
|---|---|
make build |
Configure and build project |
make run |
Run benchmark demo |
make test |
Run all tests (81 tests) |
make clean |
Clean build artifacts |
- CUDA Toolkit 12+
- CMake 3.25+
- C++20 compatible compiler
- CUDA-capable GPU
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
See CONTRIBUTING.md.