voxtral_tts: enable CUDA backend with 4w quantization (Ampere + Blackwell pre-exported artifacts)#19093
Open
seyeong-han wants to merge 10 commits intopytorch:mainfrom
Open
voxtral_tts: enable CUDA backend with 4w quantization (Ampere + Blackwell pre-exported artifacts)#19093seyeong-han wants to merge 10 commits intopytorch:mainfrom
seyeong-han wants to merge 10 commits intopytorch:mainfrom
Conversation
Add the in-progress Voxtral TTS export, runner, parity, and acceptance tooling so the work can be resumed on another machine without losing the current investigation state. Made-with: Cursor
…NNPACK Three bugs fixed: codec reshape order (P*T to T*P), flow-matching RNG (mt19937 to xorshift64+BoxMuller matching C ref), ALiBi slopes off-by-one. Adds --speaker for live PCM output, parakeet STT gate, quantization docs and benchmarks. Authored with Claude.
…eal-time on A100)
Adds full CUDA AOTI support to voxtral_tts. Headlines on A100 80GB for
"Hello, how are you today?" with seed=42:
XNNPACK fp32 baseline: 15.3s wall clock, RTF 4.8x
CUDA fp32 + portable codec: 178s, RTF 51x (codec dominated on CPU)
CUDA 4w + CUDA codec: 3.7s, RTF 0.88x (sub-real-time)
The 4w-quant + full-CUDA pipeline matches the XNNPACK baseline on prefill
hidden state (cosine 0.999994), first-frame semantic argmax, and top-5 logits.
Suggested review order:
1. README.md, BENCHMARK.md, PROGRESS.md -- user-visible surface
2. model.py -- StaticKVCache + StandardSDPA + causal mask + conv-as-matmul codec
3. export_voxtral_tts.py -- --backend cuda + --qlinear 4w plumbing
4. voxtral_tts_runner.{h,cpp}, main.cpp -- bf16 staging via lm_input_is_bf16 metadata
5. CMakePresets.json -- voxtral-tts-cuda preset
6. test_cuda_parity.py -- 11 eager-parity gates (CUDA-required, skip otherwise)
7. run_cuda_e2e.sh -- one-shot pipeline script
Authored with Claude (Anthropic) assistance.
…3_5_moe layout) Internal docs, parity tooling, and developer-only test scripts move to the voxtral-tts-dev branch. The PR now ships the same kind of files qwen3_5_moe exposes publicly: model.py, export script, runner, CMake, README. Removed (kept on voxtral-tts-dev): BENCHMARK.md, PROGRESS.md voxtral_tts_vs_voxtral_realtime_manager_note.md mermaid_architecture_voxtral_tts_parity_gap.md parity.py, compare_parity_traces.py test_cuda_parity.py, test_eager_e2e.py, test_export_cli.py, test_parity.py, test_validation_contract.py, test_verify_codec_export.py, test_verify_export_parity.py transcribe_apple_speech.swift, transcribe_parakeet.py verify_codec_export.py, verify_export_parity.py, verify_xnnpack_transcript.py Updated README and run_cuda_e2e.sh to drop links to the moved files. Authored with Claude (Anthropic) assistance.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19093
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 Cancelled Job, 3 Unrelated FailuresAs of commit a46f783 with merge base 2d53535 ( CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… layout - Revert CLAUDE.md edit that slipped into the prior commit (out of scope). - Add `voxtral_tts-cpu` and `voxtral_tts-cuda` Makefile targets following the same pattern voxtral_realtime / qwen3_5_moe use, including .PHONY + help-text entries. `make voxtral_tts-cuda` now builds parent ExecuTorch with CUDA + the runner in one step. - Rewrite README.md to mirror qwen3_5_moe's layout: Overview, Prerequisites, Export (with options table), Build (one-line `make` command), Run (with options table), Troubleshooting. Drops the previous mixed Architecture/Quick-Start/Build/Run shape. Authored with Claude (Anthropic) assistance.
ufmt + clang-format whitespace and import-ordering only. No semantic changes. Authored with Claude (Anthropic) assistance.
mergennachin
approved these changes
Apr 24, 2026
Contributor
Author
|
@pytorchbot label "release notes: examples" |
…ntro and new Streaming section
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR brings the ExecuTorch CUDA AOTI backend to
examples/models/voxtral_tts. The full pipeline (LM + flow head + codec) now runs on GPU. With--qlinear 4wand--streaming, end-to-end synthesis for a 24-token prompt on an RTX 5080 runs at RTF 0.31x — over 3× real-time with 2.6 s time-to-first-audio.Headlines
Streaming (
--streaming --speaker) decouples generation latency from audio length: the first chunk arrives in ~2.6 s, then 2 s chunks emit continuously as the model decodes. On RTX 5080 (sm_120), a 24-token prompt synthesizes 10.3 s of audio while only blocking the speaker for 3.85 s wall — 3.2× faster than real-time.Numerical parity vs XNNPACK FP32 on
seed=42:What changed and why
The CUDA enablement broke into five challenges:
Causal mask was missing on CUDA.
MistralDecoder.forwarddidn't build any attention mask. CUDA'striton.sdpathen attended over the full zero-initialized[1, H_kv, max_seq_len, D]cache. Fix: port_build_causal_mask_boolfromvoxtral_realtime/model.pyand thread through every layer (CUDA only — XNNPACK custom_sdpa infers prefix length fromstart_posinternally).AOTI Triton SDPA only accepts bf16. Initial fix promoted the entire model to bf16, which degraded
semantic_headandpredict_velocityprecision. Fix: isolate bf16 toStaticKVCache(BHSD bf16 buffers) +StandardSDPA(cast Q to bf16 beforetriton.sdpa, cast result back to input dtype). Model weights stay fp32.Triton conv autotune has no choices for the codec's ConvTranspose1d shapes, and AOTI's CUDA runtime ships no
aoti_torch_cuda_convolutionshim. Fix: rewrite Conv1d / ConvTranspose1d asunfold + matmul/matmul + Fold(_conv1d_as_matmul,_conv_transpose1d_as_matmulinmodel.py). Math is bit-exact (eager parity max abs diff 5.5e-10 in fp32). Triton's batched-matmul autotune found 20 valid kernel choices for the new path where the conv form had 0.4w quantization plumbing.
--qlinear 4won CUDA auto-promotes--dtypeto bf16 and auto-sets--qlinear-packing-format=tile_packed_to_4d(required by_weight_int4pack_mm).flow_head.input_projection(3072×36) is auto-skipped because K=36 isn't divisible by group_size=32 — caught viaskip_incompatible_shapes=True.Runner bf16 staging. Model.pte exports an
lm_input_is_bf16metadata int. The runner reads it at load time and switches itsfrom_blob(...)calls to bf16 staging buffers when the model is bf16 (quantized exports). Default fp32 path stays untouched.An adversarial review pass caught issues (1) and (2) before this PR went out.
How to validate
Pre-exported artifacts on HF Hub
Sub-real-time CUDA artifacts are distributed at
younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDAin per-arch subfolders so users can skip the export step:
sm80/sm120/Streaming numbers: 24 text tokens → 10.3 s audio, measured on warm Triton autotune cache.
Offline numbers: 7 text tokens ("Hello, how are you today?"), same warm-cache condition.
AOTI bakes pre-compiled cubins for the export-time arch into the
*.ptd, so cubins aren't compatible across architectures — running ansm_80blob on a Blackwell card fails withCUDA driver error: invalid argumenton the first kernel launch. The README's "Pre-exported artifacts" section documents the per-arch download pattern and the WSL2LIBRARY_PATHlinker gotcha hit during the Blackwell re-export.Suggested review order
README.md— user-visible surface (CUDA section + streaming perf + gotchas table + multi-arch HF artifact section)model.py—StaticKVCache(BHSD bf16) +StandardSDPA(bf16 cast in/out) +_build_causal_mask_bool+_conv1d_as_matmul/_conv_transpose1d_as_matmulexport_voxtral_tts.py—--backend cuda+--qlinear 4wplumbing, helper extraction (_apply_cuda_arg_defaults,_export_lm_pte,_export_codec_pte)voxtral_tts_runner.{h,cpp},main.cpp—--data_path/--codec_data_pathflags, bf16 staging path gated onlm_input_is_bf16metadataCMakePresets.json—voxtral-tts-cudapresetrun_cuda_e2e.sh— one-shot pipeline script