Skip to content

acemoglu/SwiftMetalNumerics

Repository files navigation

SwiftMetalNumerics

Swift Platforms Metal SPM

A fast, modern library for GPU-accelerated linear algebra, DSP, and basic neural network layers on Apple platforms.

graph LR
    CPU[CPU<br>Accelerate / LAPACK] <-->|Zero-Copy| RAM[(Apple Silicon<br>Unified Memory)]
    GPU[GPU<br>Metal Performance Shaders] <-->|Zero-Copy| RAM
Loading

Built specifically for Apple Silicon, SwiftMetalNumerics takes advantage of unified memory to perform zero-copy operations. It uses Metal Performance Shaders (MPS) for maximum GPU performance and automatically falls back to CPU-optimized Accelerate / LAPACK paths when needed.

Table of Contents

Why use this?

  • Apple Silicon Optimized: Native zero-copy data binding. CPU and GPU share the same physical memory pages, meaning no expensive data transfers before doing math.
  • Async & Safe: Built from the ground up for Swift Concurrency. Heavy math operations are wrapped in simple async/await calls and pushed to background threads so your UI never freezes. Everything is Sendable.
  • Core ML Friendly: Seamlessly convert between our matrices and MLMultiArray without copying the underlying memory.
  • Batteries Included: Supports FP32 and FP16 data types (FP16 requires compatible GPU support for MPSGraph).

What's Inside

🧮 Linear Algebra

Operation Details
Basic Math Matrix Multiply (A × B), Hadamard (.*), Trace, Transpose
Element-wise sqrt, exp, log, sin, min, max, mean
Advanced Ops Determinant, Inverse, Solve (Ax = b), SVD, Cholesky

📡 DSP & Neural Networks

Feature Details
FFT (Signal) Forward/Inverse 1D FFT for real & complex signals (power-of-two length)
STFT STFT over real signals → magnitude or complex-interleaved spectrogram
FFT (Tensor) Forward/Inverse 2D FFT on rank-2 tensors (power-of-two dims)
Convolution (Matrix) 2D single-channel image convolution (NumericMatrix.convolved)
Layers (Matrix) LinearLayer on [batch × features] matrices
Layers (Tensor / NCHW) Conv2DLayer, MaxPooling2DLayer, AveragePooling2DLayer on 4D NCHW tensors
Activations ReLU, Sigmoid, Softmax

Benchmarks

All benchmarks were run on an iPhone 15 Pro (iPhone16,1 / A17 Pro).

🚀 Highlight: DSP & FFT (14x GPU Speedup)

The GPU implementation of FFT demonstrates massive performance gains over CPU (vDSP).

Operation Workload GPU Time (MPSGraph) CPU Time (vDSP)
1D FFT 1,048,576 samples (2^20) 0.0188 s 0.2646 s
2D FFT 2048×2048 real 11.65 s (Forward) -
STFT 10M samples, frame 2048, hop 256 92.50 s (~39k frames) -

Matrix Multiplication (GEMM)

GPU (Metal) vs CPU (Accelerate) for n × n matrices.

Engineering Note: For sizes under 8192, CPU (Accelerate) outperforms GPU. This is expected on Apple Silicon, as Accelerate uses the undocumented AMX (Apple Matrix Coprocessor) which has near-zero dispatch overhead compared to Metal. GPU paths shine in batched or complex ML pipelines where memory stays strictly on the GPU.

Size (n×n) CPU FP32 (s) GPU FP32 (s) GPU FP16 (s)
1024 0.0064 0.0153 0.2338
2048 0.0313 0.0361 0.9391
4096 0.1936 0.2109 3.8201

Neural Inference

End-to-end pipeline: Conv2D → ReLU → MaxPool

Precision Wall Time (s) Workload / Notes
FP32 2.2100 ~4.37 GFLOPS (conv-dom. est.)
FP16 2.0962 Native FP16 (no FP32 casts)

Speedup FP32 to FP16: 1.05×

Advanced Linear Algebra (LAPACK)

These operations use Accelerate/LAPACK CPU paths natively. Workload intensity uses classical flop models.

Operation Time (s) Est. GFLOPS Notes
Symmetric Eigen 0.8651 6.62 λ₀…λ₂: 2, 2, 2
SVD 1.0538 4.08 Thin SVD (LAPACK)
QR 0.5376 2.66 sgeqrf / sorgqr
Cholesky 0.2545 1.41 SPD tridiagonal band

Requirements

  • Swift 6.2
  • macOS 14 / iOS 17 / tvOS 17 or newer
  • Xcode / SPM

Installation

You can integrate SwiftMetalNumerics into your project using Swift Package Manager.

Add the following to your Package.swift file's dependencies:

dependencies: [
  .package(url: "https://github.com/acemoglu/SwiftMetalNumerics.git", branch: "main")
]

Or, in Xcode, go to File > Add Package Dependencies... and enter the repository URL.


Usage examples

Almost all numerical operations are async throws and should be called from an async context (e.g. Task { ... } or an async function). Import the module once:

import SwiftMetalNumerics

GPU availability

Optional: branch on Metal before assuming GPU-backed paths (timing, UI, tests).

if SwiftMetalNumericsCapabilities.isMetalGPUAvailable {
    // GPU paths are preferred; library still falls back where implemented
}

Matrices: create, fill, and read data

Row-major storage: element index for row i, column j is i * columns + j.

// From explicit elements (row-major [Float])
let A = try NumericMatrix(rows: 2, columns: 3, elements: [1, 2, 3, 4, 5, 6])

// Zero-filled
let Z = try NumericMatrix(rows: 4, columns: 4)

// Identity and diagonal helpers
let I = try NumericMatrix.identity(size: 5)
let D = try NumericMatrix.diagonal(values: [1, 2, 3])

// Read-only / mutable access to scalars without extra copies of the backing store
let sum = A.withFloatBuffer { buf in
    buf.reduce(0, +)
}

var B = A
try B.withMutableFloatBuffer { ptr in
    if let base = ptr.baseAddress { base[0] = 99 }
}

Matrix multiply (GEMM), transpose, and Hadamard

A * B is matrix multiplication when both operands are NumericMatrix. Element-wise product uses hadamard or .*.

let M = try NumericMatrix(rows: 2, columns: 2, elements: [1, 2, 3, 4])
let N = try NumericMatrix(rows: 2, columns: 2, elements: [5, 6, 7, 8])

let product = try await M.multiplied(by: N)
// same as: try await M * N

let Mt = try await M.transposed()
let H = try await M.hadamard(N)
// or: try await M .* N

// Scalar scaling: matrix * scalar / scalar * matrix (not matrix×matrix)
let scaled = try await M * 0.5
let scaled2 = try await 2.0 * M

// CPU-only GEMM (e.g. benchmarks vs GPU):
let cpu = try await M.multipliedUsingAccelerateCPU(by: N)

Element-wise arithmetic and math functions

Shapes must match for +, -, and Hadamard.

let S = try await M.adding(N)
let D = try await M.subtracting(N)
// Operators: try await M + N, try await M - N

let E = try await M.scaled(by: 0.25)
let rsqrt = try await M.sqrt()
let e = try await M.exp()
let l = try await M.log()
let s = try await M.sin()

Reductions and trace

let mx = try await M.max()
let mn = try await M.min()
let mu = try await M.mean()
let t = try await I.trace()   // square matrices only

Linear systems, inverse, determinant, SVD, Cholesky

These use LAPACK (Accelerate) on copies; they run off the caller’s executor via Task.detached.

// Solve A x = b for square A
let A = try NumericMatrix(rows: 3, columns: 3, elements: [
    2, 0, 0,
    0, 2, 0,
    0, 0, 2
])
let x = try await A.solve(b: [1, 2, 3])

let inv = try await A.inversed()
let det = try await A.determinant()
let (U, S, Vt) = try await A.svd()
let L = try await A.cholesky()   // symmetric positive-definite A only

2D convolution (single-channel matrix)

Kernel height and width must be odd and fit inside the single-channel image. GPU path uses MPSGraph; CPU fallback uses Accelerate (vDSP_imgfir) with the documented constraints. For stride/padding control in models, use Conv2DLayer (NCHW) below.

let height = 32
let width = 32
let image = try NumericMatrix(
    rows: height,
    columns: width,
    elements: Array(repeating: 0.1, count: height * width)
)

let kernel = try NumericMatrix(rows: 3, columns: 3, elements: [
    0, 0.25, 0,
    0.25, 0, 0.25,
    0, 0.25, 0
])

let out = try await image.convolved(with: kernel)

FFT: real, complex, inverse (1D signal)

Length must be a power of two and at least 2.

// Real input → complex spectrum
let n = 512
let real = (0..<n).map { Float($0) }
var signal = Signal1D(realSamples: real)
let spectrum = try await signal.forwardFFT()

// Inverse spectrum → time-domain signal
let reconstructed = try await spectrum.inverseFFT()

// General complex input
let complexSamples: [ComplexFloat] = (0..<n).map {
    ComplexFloat(real: Float($0), imaginary: 0)
}
signal = Signal1D(samples: complexSamples)
let spec2 = try await signal.forwardFFT()

// CPU-only FFT (benchmarks / forcing vDSP):
let cpuSpectrum = try await Signal1D.forwardFFTReal1DUsingAccelerate(real)

STFT: magnitude or complex-interleaved spectrogram

frameSize must be a power of two and at least 2. hopSize must be positive.

let sr = 48_000
let seconds: Float = 1.0
let count = Int(Float(sr) * seconds)
let samples = (0..<count).map { i in
    // 440 Hz sine
    sin(2 * .pi * 440 * Float(i) / Float(sr))
}

// Magnitude spectrogram: shape [numFrames, frameSize]
let mag = try await STFT.compute(
    signal: samples,
    frameSize: 1024,
    hopSize: 256,
    window: .hann,
    outputKind: .magnitude
)

// Complex interleaved: shape [numFrames, frameSize, 2] where [...,0]=real [...,1]=imag
let complex = try await STFT.compute(
    signal: samples,
    frameSize: 1024,
    hopSize: 256,
    window: .hann,
    outputKind: .complexInterleaved
)

Tensors (FP32): NumericTensor basics

NumericTensor is an N-dimensional dense array with explicit shape and row-major layout helpers. Many DSP/neural ops consume/produce tensors.

// Rank-1
let v = try NumericTensor(contiguousRowMajorShape: [8], elements: (0..<8).map { Float($0) })

// Rank-4 NCHW: [batch, channels, height, width]
let n = 4, c = 3, h = 32, w = 32
let x = try NumericTensor(
    contiguousRowMajorShape: [n, c, h, w],
    elements: Array(repeating: 0.01, count: n * c * h * w)
)

// Read back scalars (requires contiguous row-major)
let first10 = try x.makeContiguousRowMajor().withFloatBuffer { buf in
    Array(buf.prefix(10))
}
_ = first10

Tensor FFT: forward/inverse 2D FFT (rank-2)

Input must be rank-2 real tensor with both dimensions power-of-two. Output spectrum is complex-interleaved in the last axis.

let H = 64, W = 64
let img = try NumericTensor(
    contiguousRowMajorShape: [H, W],
    elements: Array(repeating: 1.0, count: H * W)
)

// Forward: [H, W, 2] (real/imag)
let spec = try await img.forwardFFT2D()

// Inverse: returns real rank-2 tensor [H, W] (imag discarded)
let back = try await spec.inverseFFT2D()
_ = back

Half precision (FP16): matrices & tensors

FP16 graph execution requires a GPU that satisfies MPSGraph FP16 support (e.g. Apple GPU family 7+ or compatible).

import Metal

let device = MTLCreateSystemDefaultDevice()!
guard MPSGraphFloat16Support.isSupported(on: device) else {
    fatalError("FP16 ops not supported on this GPU")
}

// Matrix FP32 → FP16 (validated against default device) → multiply → back to FP32
let X = try NumericMatrix(rows: 2, columns: 2, elements: [1, 2, 3, 4])
let X16 = try await X.convertedToFloat16()
let Y16 = try NumericMatrix16(rows: 2, columns: 2, elements: [1, 0, 0, 1])
let P16 = try await X16.multiplied(by: Y16)
let back32 = try P16.toFloat32Matrix()
_ = back32

Neural building blocks

Linear layer (matrix)

Input rows = batch size, columns = inFeatures. Use dataType: .float16 only when FP16 support passes.

// Random Xavier init (FP32)
let fc = try LinearLayer(inFeatures: 10, outFeatures: 4, bias: true, dataType: .float32)
let batch = try NumericMatrix(rows: 8, columns: 10, elements: Array(repeating: 0.1, count: 8 * 10))
let logits = try await fc.forward(batch)

// Fixed weights [inFeatures × outFeatures], bias [1 × outFeatures]
let W = try NumericMatrix(rows: 10, columns: 4, elements: Array(repeating: 0.01, count: 40))
let b = try NumericMatrix(rows: 1, columns: 4, elements: [0, 0, 0, 1])
let fc2 = try LinearLayer(weights: W, bias: b, inFeatures: 10, outFeatures: 4)
let logits2 = try await fc2.forward(batch)
_ = (logits, logits2)

Conv2D (tensor, NCHW)

Conv2DLayer consumes NCHW tensors shaped [N, inChannels, H, W]. Weights use [outChannels, inChannels, kH, kW]. You can configure rectangular kernels, (strideY, strideX) and (paddingHeight, paddingWidth) (applied symmetrically); use NeuralSpatialMath if you want explicit per-edge shape math.

// FP32 NCHW convolution
let conv = try Conv2DLayer(
    inChannels: 3,
    outChannels: 16,
    kernelHeight: 3,
    kernelWidth: 5,
    stride: (1, 2),
    padding: (1, 2),
    bias: true,
    dataType: .float32
)

let x = try NumericTensor(
    contiguousRowMajorShape: [4, 3, 32, 32],
    elements: Array(repeating: 0.02, count: 4 * 3 * 32 * 32)
)

let y = try await conv.forward(x)
_ = y

Pooling (tensor, NCHW)

Pooling layers also operate on NCHW [N, C, H, W].

let maxPool = try MaxPooling2DLayer(kernelSize: 2, stride: 2, padding: 0)
let avgPool = try AveragePooling2DLayer(kernelSize: (3, 3), stride: (2, 2), padding: (1, 1))

let y1 = try await maxPool.forward(x)
let y2 = try await avgPool.forward(x)
_ = (y1, y2)

Activations (matrix)

let relu = ActivationLayer(type: .relu)
let sig = ActivationLayer(type: .sigmoid)
let sm = ActivationLayer(type: .softmax(axis: 1))

let h = try await relu.forward(logits)
let p = try await sm.forward(h)
_ = (sig, p)

Shape helpers (NCHW)

If you want to compute expected output shapes before running conv/pool:

let outShape = NeuralSpatialMath.conv2DOutputShapeNCHW(
    batch: 4,
    height: 32,
    width: 32,
    outChannels: 16,
    kernelHeight: 3,
    kernelWidth: 5,
    strideHeight: 1,
    strideWidth: 2,
    paddingTop: 1,
    paddingBottom: 1,
    paddingLeft: 2,
    paddingRight: 2
)
_ = outShape

Core ML: MLMultiArray <-> NumericMatrix

Zero-copy out of NumericMatrix (keeps storage alive via deallocator):

import CoreML

let W = try NumericMatrix(rows: 3, columns: 3, elements: Array(repeating: 1, count: 9))
let arr = try W.asMLMultiArray()
// Use `arr` while the owning matrix is retained by the deallocator; avoid mutating `W` concurrently.

// Copy **in** from a rank-2 Float32 MLMultiArray
if let roundtrip = NumericMatrix(mlMultiArray: arr) {
    _ = roundtrip
}

Benchmark: CPU vs GPU GEMM sweep

Produces a structured report plus a formatted monospace table.

let report = try await GEMMBenchmarkRunner.run(
    sampleCount: 10,
    sizes: [256, 512, 1024],
    includeGpu: nil // default: uses SwiftMetalNumericsCapabilities.isMetalGPUAvailable
)
print(report.formattedTable)

// Or:
// try await GEMMBenchmarkRunner.runPrintingToStandardOut()

Low-level buffers (advanced)

Use when you own raw memory and need makeBuffer(bytesNoCopy:) rules (pointer aligned to page size, length a multiple of page size).

import Metal

let device = MTLCreateSystemDefaultDevice()!
let (ptr, bytes) = try BufferManager.allocatePageAlignedByteBuffer(minimumByteCount: 4096)
defer { BufferManager.freeAligned(ptr) }

// Fill `ptr` with your payload, then:
let buffer = try BufferManager.makeSharedBufferNoCopy(
    device: device,
    bytes: ptr,
    length: bytes,
    deallocator: { p, _ in
        BufferManager.freeAligned(p)
    }
)
_ = buffer

Bridging an existing contiguous buffer (zero-copy only if already page-aligned and length is page-multiple):

var data = [Float](repeating: 1, count: 1024)
_ = try data.withUnsafeMutableBufferPointer { buf in
    try BufferManager.withBridgedSharedBuffer(device: device, contiguous: buf) { mtlBuffer in
        mtlBuffer.length
    }
}

License

This project is licensed under the MIT License. See LICENSE.

About

A high-performance, zero-copy Linear Algebra and DSP library for Apple Silicon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages