Fast speech recognition with NVIDIA's Parakeet models in pure C++.
Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.
~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU. FP16 support for ~2x memory reduction.
| Model | Class | Size | Type | Description |
|---|---|---|---|---|
tdt-ctc-110m |
ParakeetTDTCTC |
110M | Offline | English, dual CTC/TDT decoder heads |
tdt-600m |
ParakeetTDT |
600M | Offline | Multilingual, TDT decoder |
eou-120m |
ParakeetEOU |
120M | Streaming | English, RNNT with end-of-utterance detection |
nemotron-600m |
ParakeetNemotron |
600M | Streaming | Multilingual, configurable latency (80ms–1120ms) |
sortformer |
Sortformer |
117M | Streaming | Speaker diarization (up to 4 speakers) |
diarized |
DiarizedTranscriber |
110M+117M | Offline | ASR + diarization → speaker-attributed words |
All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.
#include <parakeet/parakeet.hpp>
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional — Metal acceleration
t.to_half(); // optional — FP16 inference (~2x memory reduction)
auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;- Multiple decoders — CTC greedy, TDT greedy, CTC beam search, TDT beam search (switch at call site)
- Word timestamps — Per-word start/end times and confidence scores on all decoders
- Beam search + LM — CTC and TDT beam search with optional ARPA n-gram language model fusion
- Phrase boosting — Context biasing via token-level trie for domain-specific vocabulary
- Batch transcription — Multiple files in one batched encoder forward pass
- VAD preprocessing — Silero VAD strips silence before ASR; timestamps auto-remapped
- GPU acceleration — Metal via axiom's MPSGraph compiler (96x speedup on Apple Silicon)
- FP16 inference — Half-precision weights and compute (~2x memory reduction)
- Streaming — EOU and Nemotron models with chunked audio input
- Speaker diarization — Sortformer (up to 4 speakers), combinable with ASR for speaker-attributed words
- C API — Flat
extern "C"FFI for Python, Swift, Go, Rust, and other languages - Multi-format audio — WAV, FLAC, MP3, OGG with automatic resampling
See examples/ for code demonstrating each feature.
Prebuilt binaries are attached to each GitHub release for macOS arm64, macOS x86_64, and Linux x86_64. Download the tarball for your platform and extract:
tar -xzf parakeet-v0.1.0-macos-arm64.tar.gz
cd parakeet-v0.1.0-macos-arm64
# On macOS, clear the Gatekeeper quarantine attribute first:
xattr -dr com.apple.quarantine .
./bin/parakeet --helpThe archive ships a self-contained bin/parakeet (and bin/example-server) plus lib/libaxiom with @rpath/$ORIGIN set so the binaries resolve their dependencies relative to the install dir — drop the directory anywhere. The C-API headers under include/parakeet/ are included for embedders.
git clone --recursive https://github.com/frikallo/parakeet.cpp
cd parakeet.cpp
make build
make testRequirements: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU.
macOS: building requires the full Xcode install (not just Command Line Tools), because axiom compiles its Metal shaders with
xcrun metalandxcrun metallib— those ship only with Xcode. If you just want to run parakeet, use the prebuilt tarball above; the.metallibis embedded into the shippedlibaxiom.dyliband runs without any Xcode/CLT install on the user side.
# Download from HuggingFace
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .
# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensorsThe converter supports all model types: 110m-tdt-ctc (default), 600m-tdt, eou-120m, nemotron-600m, sortformer.
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdtSilero VAD weights:
python scripts/convert_silero_vad.py -o silero_vad_v5.safetensors| Example | Description |
|---|---|
| basic | Simplest transcription (~20 lines) |
| timestamps | Word/token timestamps with confidence |
| beam-search | CTC/TDT beam search with optional ARPA LM |
| phrase-boost | Context biasing for domain vocabulary |
| batch | Batch transcription of multiple files |
| vad | Standalone VAD and ASR+VAD preprocessing |
| gpu | Metal GPU + FP16 with timing comparison |
| stream | EOU streaming transcription |
| nemotron | Nemotron streaming with latency modes |
| diarize | Sortformer speaker diarization |
| diarized-transcription | ASR + diarization combined |
| c-api | Pure C99 FFI usage |
| cli | Full CLI with all options |
After installing (make install or cmake --install build):
find_package(Parakeet REQUIRED)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)add_subdirectory(third_party/parakeet.cpp)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)g++ -std=c++20 myapp.cpp $(pkg-config --cflags --libs parakeet) -o myappBuilt on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):
| Model | Class | Decoder | Use case |
|---|---|---|---|
| CTC | ParakeetCTC |
Greedy argmax or beam search (+LM) | Fast, English-only |
| RNNT | ParakeetRNNT |
Autoregressive LSTM | Streaming capable |
| TDT | ParakeetTDT |
LSTM + duration prediction, greedy or beam search (+LM) | Better accuracy than RNNT |
| TDT-CTC | ParakeetTDTCTC |
Both TDT and CTC heads | Switch decoder at inference |
Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:
| Model | Class | Decoder | Use case |
|---|---|---|---|
| EOU | ParakeetEOU |
Streaming RNNT | End-of-utterance detection |
| Nemotron | ParakeetNemotron |
Streaming TDT | Configurable latency streaming |
| Model | Class | Architecture | Use case |
|---|---|---|---|
| Sortformer | Sortformer |
NEST encoder → Transformer → sigmoid | Speaker diarization (up to 4 speakers) |
Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).
Encoder throughput — 10s audio:
| Model | Params | CPU (ms) | GPU (ms) | GPU Speedup |
|---|---|---|---|---|
| 110m (TDT-CTC) | 110M | 2,581 | 27 | 96x |
| tdt-600m | 600M | 10,779 | 520 | 21x |
| rnnt-600m | 600M | 10,648 | 1,468 | 7x |
| sortformer | 117M | 3,195 | 479 | 7x |
110m GPU scaling across audio lengths:
| Audio | CPU (ms) | GPU (ms) | RTF | Throughput |
|---|---|---|---|---|
| 1s | 262 | 24 | 0.024 | 41x |
| 5s | 1,222 | 26 | 0.005 | 190x |
| 10s | 2,581 | 27 | 0.003 | 370x |
| 30s | 10,061 | 32 | 0.001 | 935x |
| 60s | 26,559 | 72 | 0.001 | 833x |
GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"- Confidence scores — Per-token and per-word confidence from token log-probs
- Phrase boosting — Token-level trie context biasing during decode
- Beam search — CTC prefix beam search and TDT time-synchronous beam search
- N-gram LM fusion — ARPA language models scored at word boundaries
- Multi-format audio — WAV, FLAC, MP3, OGG via dr_libs + stb_vorbis
- Automatic resampling — Windowed sinc interpolation (Kaiser, 16-tap)
- Load from memory —
read_audio(bytes, len), float/int16 buffers - Audio duration query — Header-only duration without full decode
- Progress callbacks — Stage reporting for long files
- Streaming from raw PCM — Direct microphone buffer feeding
- Diarized transcription — ASR + Sortformer → speaker-attributed words
- VAD — Silero VAD v5, standalone + ASR preprocessing
- Batch inference — Padded multi-file encoder forward pass
- Long-form chunking — Overlapping windows for audio >30s
- Neural LM rescoring — N-best reranking with Transformer LM
- C API — Flat C interface for FFI from any language
- FP16 inference — Half-precision weights and compute
- Model quantization — INT8/INT4 for mobile deployment
- Hotword detection — Trigger phrase detection
- Speaker embeddings — Speaker verification from Sortformer/TitaNet
- Audio: 16kHz mono (WAV, FLAC, MP3, OGG — auto-detected and resampled)
- Offline models have ~4-5 minute audio length limits; use streaming models for longer audio
- GPU acceleration requires Apple Silicon with Metal support
MIT