A fast, modern library for GPU-accelerated linear algebra, DSP, and basic neural network layers on Apple platforms.
graph LR
CPU[CPU<br>Accelerate / LAPACK] <-->|Zero-Copy| RAM[(Apple Silicon<br>Unified Memory)]
GPU[GPU<br>Metal Performance Shaders] <-->|Zero-Copy| RAM
Built specifically for Apple Silicon, SwiftMetalNumerics takes advantage of unified memory to perform zero-copy operations. It uses Metal Performance Shaders (MPS) for maximum GPU performance and automatically falls back to CPU-optimized Accelerate / LAPACK paths when needed.
- Why use this?
- What's Inside
- Benchmarks
- Requirements
- Installation
- Usage examples
- GPU availability
- Matrices: create, fill, and read data
- Matrix multiply, transpose, and Hadamard
- Element-wise arithmetic and math functions
- Reductions and trace
- Linear systems, inverse, determinant, SVD, Cholesky
- 2D convolution (single-channel matrix)
- FFT: real, complex, inverse (1D signal)
- STFT: magnitude or complex-interleaved spectrogram
- Tensors (FP32): NumericTensor basics
- Tensor FFT: forward/inverse 2D FFT (rank-2)
- Half precision (FP16): matrices & tensors
- Neural building blocks
- Shape helpers (NCHW)
- Core ML: MLMultiArray <-> NumericMatrix
- Benchmark: CPU vs GPU GEMM sweep
- Low-level buffers (advanced)
- License
- Apple Silicon Optimized: Native zero-copy data binding. CPU and GPU share the same physical memory pages, meaning no expensive data transfers before doing math.
- Async & Safe: Built from the ground up for Swift Concurrency. Heavy math operations are wrapped in simple
async/awaitcalls and pushed to background threads so your UI never freezes. Everything isSendable. - Core ML Friendly: Seamlessly convert between our matrices and
MLMultiArraywithout copying the underlying memory. - Batteries Included: Supports FP32 and FP16 data types (FP16 requires compatible GPU support for MPSGraph).
| Operation | Details |
|---|---|
| Basic Math | Matrix Multiply (A × B), Hadamard (.*), Trace, Transpose |
| Element-wise | sqrt, exp, log, sin, min, max, mean |
| Advanced Ops | Determinant, Inverse, Solve (Ax = b), SVD, Cholesky |
| Feature | Details |
|---|---|
| FFT (Signal) | Forward/Inverse 1D FFT for real & complex signals (power-of-two length) |
| STFT | STFT over real signals → magnitude or complex-interleaved spectrogram |
| FFT (Tensor) | Forward/Inverse 2D FFT on rank-2 tensors (power-of-two dims) |
| Convolution (Matrix) | 2D single-channel image convolution (NumericMatrix.convolved) |
| Layers (Matrix) | LinearLayer on [batch × features] matrices |
| Layers (Tensor / NCHW) | Conv2DLayer, MaxPooling2DLayer, AveragePooling2DLayer on 4D NCHW tensors |
| Activations | ReLU, Sigmoid, Softmax |
All benchmarks were run on an iPhone 15 Pro (iPhone16,1 / A17 Pro).
The GPU implementation of FFT demonstrates massive performance gains over CPU (vDSP).
| Operation | Workload | GPU Time (MPSGraph) | CPU Time (vDSP) |
|---|---|---|---|
| 1D FFT | 1,048,576 samples (2^20) | 0.0188 s | 0.2646 s |
| 2D FFT | 2048×2048 real | 11.65 s (Forward) | - |
| STFT | 10M samples, frame 2048, hop 256 | 92.50 s (~39k frames) | - |
GPU (Metal) vs CPU (Accelerate) for n × n matrices.
Engineering Note: For sizes under 8192, CPU (Accelerate) outperforms GPU. This is expected on Apple Silicon, as Accelerate uses the undocumented AMX (Apple Matrix Coprocessor) which has near-zero dispatch overhead compared to Metal. GPU paths shine in batched or complex ML pipelines where memory stays strictly on the GPU.
| Size (n×n) | CPU FP32 (s) | GPU FP32 (s) | GPU FP16 (s) |
|---|---|---|---|
| 1024 | 0.0064 | 0.0153 | 0.2338 |
| 2048 | 0.0313 | 0.0361 | 0.9391 |
| 4096 | 0.1936 | 0.2109 | 3.8201 |
End-to-end pipeline: Conv2D → ReLU → MaxPool
| Precision | Wall Time (s) | Workload / Notes |
|---|---|---|
| FP32 | 2.2100 | ~4.37 GFLOPS (conv-dom. est.) |
| FP16 | 2.0962 | Native FP16 (no FP32 casts) |
Speedup FP32 to FP16: 1.05×
These operations use Accelerate/LAPACK CPU paths natively. Workload intensity uses classical flop models.
| Operation | Time (s) | Est. GFLOPS | Notes |
|---|---|---|---|
| Symmetric Eigen | 0.8651 | 6.62 | λ₀…λ₂: 2, 2, 2 |
| SVD | 1.0538 | 4.08 | Thin SVD (LAPACK) |
| QR | 0.5376 | 2.66 | sgeqrf / sorgqr |
| Cholesky | 0.2545 | 1.41 | SPD tridiagonal band |
- Swift 6.2
- macOS 14 / iOS 17 / tvOS 17 or newer
- Xcode / SPM
You can integrate SwiftMetalNumerics into your project using Swift Package Manager.
Add the following to your Package.swift file's dependencies:
dependencies: [
.package(url: "https://github.com/acemoglu/SwiftMetalNumerics.git", branch: "main")
]Or, in Xcode, go to File > Add Package Dependencies... and enter the repository URL.
Almost all numerical operations are async throws and should be called from an async context (e.g. Task { ... } or an async function). Import the module once:
import SwiftMetalNumericsOptional: branch on Metal before assuming GPU-backed paths (timing, UI, tests).
if SwiftMetalNumericsCapabilities.isMetalGPUAvailable {
// GPU paths are preferred; library still falls back where implemented
}Row-major storage: element index for row i, column j is i * columns + j.
// From explicit elements (row-major [Float])
let A = try NumericMatrix(rows: 2, columns: 3, elements: [1, 2, 3, 4, 5, 6])
// Zero-filled
let Z = try NumericMatrix(rows: 4, columns: 4)
// Identity and diagonal helpers
let I = try NumericMatrix.identity(size: 5)
let D = try NumericMatrix.diagonal(values: [1, 2, 3])
// Read-only / mutable access to scalars without extra copies of the backing store
let sum = A.withFloatBuffer { buf in
buf.reduce(0, +)
}
var B = A
try B.withMutableFloatBuffer { ptr in
if let base = ptr.baseAddress { base[0] = 99 }
}A * B is matrix multiplication when both operands are NumericMatrix. Element-wise product uses hadamard or .*.
let M = try NumericMatrix(rows: 2, columns: 2, elements: [1, 2, 3, 4])
let N = try NumericMatrix(rows: 2, columns: 2, elements: [5, 6, 7, 8])
let product = try await M.multiplied(by: N)
// same as: try await M * N
let Mt = try await M.transposed()
let H = try await M.hadamard(N)
// or: try await M .* N
// Scalar scaling: matrix * scalar / scalar * matrix (not matrix×matrix)
let scaled = try await M * 0.5
let scaled2 = try await 2.0 * M
// CPU-only GEMM (e.g. benchmarks vs GPU):
let cpu = try await M.multipliedUsingAccelerateCPU(by: N)Shapes must match for +, -, and Hadamard.
let S = try await M.adding(N)
let D = try await M.subtracting(N)
// Operators: try await M + N, try await M - N
let E = try await M.scaled(by: 0.25)
let rsqrt = try await M.sqrt()
let e = try await M.exp()
let l = try await M.log()
let s = try await M.sin()let mx = try await M.max()
let mn = try await M.min()
let mu = try await M.mean()
let t = try await I.trace() // square matrices onlyThese use LAPACK (Accelerate) on copies; they run off the caller’s executor via Task.detached.
// Solve A x = b for square A
let A = try NumericMatrix(rows: 3, columns: 3, elements: [
2, 0, 0,
0, 2, 0,
0, 0, 2
])
let x = try await A.solve(b: [1, 2, 3])
let inv = try await A.inversed()
let det = try await A.determinant()
let (U, S, Vt) = try await A.svd()
let L = try await A.cholesky() // symmetric positive-definite A onlyKernel height and width must be odd and fit inside the single-channel image. GPU path uses MPSGraph; CPU fallback uses Accelerate (vDSP_imgfir) with the documented constraints. For stride/padding control in models, use Conv2DLayer (NCHW) below.
let height = 32
let width = 32
let image = try NumericMatrix(
rows: height,
columns: width,
elements: Array(repeating: 0.1, count: height * width)
)
let kernel = try NumericMatrix(rows: 3, columns: 3, elements: [
0, 0.25, 0,
0.25, 0, 0.25,
0, 0.25, 0
])
let out = try await image.convolved(with: kernel)Length must be a power of two and at least 2.
// Real input → complex spectrum
let n = 512
let real = (0..<n).map { Float($0) }
var signal = Signal1D(realSamples: real)
let spectrum = try await signal.forwardFFT()
// Inverse spectrum → time-domain signal
let reconstructed = try await spectrum.inverseFFT()
// General complex input
let complexSamples: [ComplexFloat] = (0..<n).map {
ComplexFloat(real: Float($0), imaginary: 0)
}
signal = Signal1D(samples: complexSamples)
let spec2 = try await signal.forwardFFT()
// CPU-only FFT (benchmarks / forcing vDSP):
let cpuSpectrum = try await Signal1D.forwardFFTReal1DUsingAccelerate(real)frameSize must be a power of two and at least 2. hopSize must be positive.
let sr = 48_000
let seconds: Float = 1.0
let count = Int(Float(sr) * seconds)
let samples = (0..<count).map { i in
// 440 Hz sine
sin(2 * .pi * 440 * Float(i) / Float(sr))
}
// Magnitude spectrogram: shape [numFrames, frameSize]
let mag = try await STFT.compute(
signal: samples,
frameSize: 1024,
hopSize: 256,
window: .hann,
outputKind: .magnitude
)
// Complex interleaved: shape [numFrames, frameSize, 2] where [...,0]=real [...,1]=imag
let complex = try await STFT.compute(
signal: samples,
frameSize: 1024,
hopSize: 256,
window: .hann,
outputKind: .complexInterleaved
)NumericTensor is an N-dimensional dense array with explicit shape and row-major layout helpers. Many DSP/neural ops consume/produce tensors.
// Rank-1
let v = try NumericTensor(contiguousRowMajorShape: [8], elements: (0..<8).map { Float($0) })
// Rank-4 NCHW: [batch, channels, height, width]
let n = 4, c = 3, h = 32, w = 32
let x = try NumericTensor(
contiguousRowMajorShape: [n, c, h, w],
elements: Array(repeating: 0.01, count: n * c * h * w)
)
// Read back scalars (requires contiguous row-major)
let first10 = try x.makeContiguousRowMajor().withFloatBuffer { buf in
Array(buf.prefix(10))
}
_ = first10Input must be rank-2 real tensor with both dimensions power-of-two. Output spectrum is complex-interleaved in the last axis.
let H = 64, W = 64
let img = try NumericTensor(
contiguousRowMajorShape: [H, W],
elements: Array(repeating: 1.0, count: H * W)
)
// Forward: [H, W, 2] (real/imag)
let spec = try await img.forwardFFT2D()
// Inverse: returns real rank-2 tensor [H, W] (imag discarded)
let back = try await spec.inverseFFT2D()
_ = backFP16 graph execution requires a GPU that satisfies MPSGraph FP16 support (e.g. Apple GPU family 7+ or compatible).
import Metal
let device = MTLCreateSystemDefaultDevice()!
guard MPSGraphFloat16Support.isSupported(on: device) else {
fatalError("FP16 ops not supported on this GPU")
}
// Matrix FP32 → FP16 (validated against default device) → multiply → back to FP32
let X = try NumericMatrix(rows: 2, columns: 2, elements: [1, 2, 3, 4])
let X16 = try await X.convertedToFloat16()
let Y16 = try NumericMatrix16(rows: 2, columns: 2, elements: [1, 0, 0, 1])
let P16 = try await X16.multiplied(by: Y16)
let back32 = try P16.toFloat32Matrix()
_ = back32Input rows = batch size, columns = inFeatures. Use dataType: .float16 only when FP16 support passes.
// Random Xavier init (FP32)
let fc = try LinearLayer(inFeatures: 10, outFeatures: 4, bias: true, dataType: .float32)
let batch = try NumericMatrix(rows: 8, columns: 10, elements: Array(repeating: 0.1, count: 8 * 10))
let logits = try await fc.forward(batch)
// Fixed weights [inFeatures × outFeatures], bias [1 × outFeatures]
let W = try NumericMatrix(rows: 10, columns: 4, elements: Array(repeating: 0.01, count: 40))
let b = try NumericMatrix(rows: 1, columns: 4, elements: [0, 0, 0, 1])
let fc2 = try LinearLayer(weights: W, bias: b, inFeatures: 10, outFeatures: 4)
let logits2 = try await fc2.forward(batch)
_ = (logits, logits2)Conv2DLayer consumes NCHW tensors shaped [N, inChannels, H, W]. Weights use [outChannels, inChannels, kH, kW]. You can configure rectangular kernels, (strideY, strideX) and (paddingHeight, paddingWidth) (applied symmetrically); use NeuralSpatialMath if you want explicit per-edge shape math.
// FP32 NCHW convolution
let conv = try Conv2DLayer(
inChannels: 3,
outChannels: 16,
kernelHeight: 3,
kernelWidth: 5,
stride: (1, 2),
padding: (1, 2),
bias: true,
dataType: .float32
)
let x = try NumericTensor(
contiguousRowMajorShape: [4, 3, 32, 32],
elements: Array(repeating: 0.02, count: 4 * 3 * 32 * 32)
)
let y = try await conv.forward(x)
_ = yPooling layers also operate on NCHW [N, C, H, W].
let maxPool = try MaxPooling2DLayer(kernelSize: 2, stride: 2, padding: 0)
let avgPool = try AveragePooling2DLayer(kernelSize: (3, 3), stride: (2, 2), padding: (1, 1))
let y1 = try await maxPool.forward(x)
let y2 = try await avgPool.forward(x)
_ = (y1, y2)let relu = ActivationLayer(type: .relu)
let sig = ActivationLayer(type: .sigmoid)
let sm = ActivationLayer(type: .softmax(axis: 1))
let h = try await relu.forward(logits)
let p = try await sm.forward(h)
_ = (sig, p)If you want to compute expected output shapes before running conv/pool:
let outShape = NeuralSpatialMath.conv2DOutputShapeNCHW(
batch: 4,
height: 32,
width: 32,
outChannels: 16,
kernelHeight: 3,
kernelWidth: 5,
strideHeight: 1,
strideWidth: 2,
paddingTop: 1,
paddingBottom: 1,
paddingLeft: 2,
paddingRight: 2
)
_ = outShapeZero-copy out of NumericMatrix (keeps storage alive via deallocator):
import CoreML
let W = try NumericMatrix(rows: 3, columns: 3, elements: Array(repeating: 1, count: 9))
let arr = try W.asMLMultiArray()
// Use `arr` while the owning matrix is retained by the deallocator; avoid mutating `W` concurrently.
// Copy **in** from a rank-2 Float32 MLMultiArray
if let roundtrip = NumericMatrix(mlMultiArray: arr) {
_ = roundtrip
}Produces a structured report plus a formatted monospace table.
let report = try await GEMMBenchmarkRunner.run(
sampleCount: 10,
sizes: [256, 512, 1024],
includeGpu: nil // default: uses SwiftMetalNumericsCapabilities.isMetalGPUAvailable
)
print(report.formattedTable)
// Or:
// try await GEMMBenchmarkRunner.runPrintingToStandardOut()Use when you own raw memory and need makeBuffer(bytesNoCopy:) rules (pointer aligned to page size, length a multiple of page size).
import Metal
let device = MTLCreateSystemDefaultDevice()!
let (ptr, bytes) = try BufferManager.allocatePageAlignedByteBuffer(minimumByteCount: 4096)
defer { BufferManager.freeAligned(ptr) }
// Fill `ptr` with your payload, then:
let buffer = try BufferManager.makeSharedBufferNoCopy(
device: device,
bytes: ptr,
length: bytes,
deallocator: { p, _ in
BufferManager.freeAligned(p)
}
)
_ = bufferBridging an existing contiguous buffer (zero-copy only if already page-aligned and length is page-multiple):
var data = [Float](repeating: 1, count: 1024)
_ = try data.withUnsafeMutableBufferPointer { buf in
try BufferManager.withBridgedSharedBuffer(device: device, contiguous: buf) { mtlBuffer in
mtlBuffer.length
}
}This project is licensed under the MIT License. See LICENSE.