The unified, extensible audio & voice suite for ComfyUI. TTS · voice cloning · speech‑to‑text — many state‑of‑the‑art open models, one clean node graph.
ComfyUI has world‑class image and video tooling — but audio has been a patchwork of one‑off nodes, each dragging in its own conflicting dependencies and often breaking your install. ComfyUI-Voice fills that gap: a single, coherent suite where every speech model is a drop‑in adapter behind one set of nodes, and everything runs natively on the PyTorch you already have — no second environment, no dependency hell.
text ─▶ [Voice TTS] ─▶ AUDIO ─▶ [Save Audio] # synthesize
audio ─▶ [Voice ASR] ─▶ TRANSCRIPT + text # transcribe
ref ──▶ [Voice TTS] ─▶ AUDIO # zero‑shot voice clone
- 🧩 One node per task, many engines. Pick an engine from a dropdown — the node’s inputs adapt to what that engine declares it can do.
- 🛡️ Runs on your torch. Engines load natively on ComfyUI’s existing PyTorch/Transformers. No version pins fighting your install, no separate venvs for the supported set. (Conflicting models can opt into isolation — but none of the verified models need it.)
- 🌍 Multilingual & SOTA. Wraps leading open models for speech synthesis, zero‑shot voice cloning, and transcription across dozens of languages — plus text‑to‑music and text‑to‑SFX generation.
- 🔌 Add a model in one file. A new engine is a single self‑registering adapter — no central dispatch, no UI plumbing to touch.
- 🧵 Composable. Everything speaks ComfyUI’s native
AUDIOtype, so it chains with the built‑in Load/Save/Preview Audio nodes for free. - ✅ Honest status. Models are marked verified only after a real round‑trip test on this codebase.
Verified (✅) = real inference tested end‑to‑end (synthesize → transcribe round‑trip) on the host stack. Experimental (🧪) = adapter shipped, deps not yet installed/validated — enable via the Voice Engine Info node’s install hint.
Engine (id) |
Capability | Languages | License | Status |
|---|---|---|---|---|
MeloTTS (melotts_korean) |
Fast preset TTS | Korean¹ | Apache‑2.0 / MIT | ✅ |
CosyVoice 3.0 (cosyvoice3) |
Zero‑shot voice clone + VC | zh · en · ja · ko · de · es · fr · it · ru | Apache‑2.0 | ✅ |
Supertonic 3 (supertonic) |
On‑device preset TTS (ONNX) | 30+ incl. ko | code MIT / model OpenRAIL‑M | ✅ |
Higgs Audio v3 (4B) (higgs_audio_v3) |
Expressive TTS + zero‑shot clone | 100+ incl. ko·zh·ja | Research/Non‑Commercial |
✅ (eval) |
Chatterbox (chatterbox) |
Clone + emotion control | 23 langs | MIT | 🧪 |
Qwen3‑TTS (qwen3_tts) |
Clone + voice design | 10 langs | Apache‑2.0 | 🧪 |
OuteTTS 1.0 (oute_tts) |
Compact LLM‑TTS | 14 langs | Apache‑2.0 | 🧪 |
Engine (id) |
Capability | Languages | License | Status |
|---|---|---|---|---|
faster‑whisper (faster_whisper) |
Fast ASR + word timestamps | 99 langs | MIT | ✅ |
Korean Whisper (korean_whisper) |
ASR (ko fine‑tune) | Korean | Apache‑2.0 | 🧪 |
Qwen3‑ASR (qwen3_asr) |
ASR + forced‑aligner timestamps | 30+ langs | Apache‑2.0 | 🧪 |
SenseVoice (sensevoice) |
Very fast ASR + emotion/event tags | 5+ langs | custom |
🧪 |
WhisperX (whisperx) |
ASR + word alignment + diarization | whisper langs | BSD‑2 (+pyannote gated) | 🧪 |
¹ The shipped checkpoint is Korean; the MeloTTS families are multilingual — additional language adapters are easy drop‑ins.
Engine (id) |
Capability | Languages | License | Status |
|---|---|---|---|---|
ACE‑Step 1.5 (ace_step) |
Text‑to‑music (instrumental / song + lyrics) | 50+ incl. ko·zh·ja·en | Apache‑2.0 / MIT | ✅ |
MOSS‑SoundEffect v2.0 (moss_soundeffect) |
Text‑to‑sound‑effect / Foley | en prompts | Apache‑2.0 | ✅ |
² Both run on the host torch and output 48 kHz AUDIO. ACE‑Step reuses ComfyUI core's native support; MOSS‑SoundEffect's inference code is vendored (no descript‑audiotools, no extra torch).
Plus four dependency‑free reference engines (reference_tone, reference_asr,
reference_music, reference_sfx) that let you smoke‑test the whole pipeline on a
clean install and serve as the adapter template.
# 1) Clone into your ComfyUI custom_nodes
cd ComfyUI/custom_nodes
git clone https://github.com/Streamize-llc/ComfyUI-Voice
# 2) Install an engine's deps (example: the workhorses)
pip install faster-whisper # ASR
pip install librosa g2pkk jamo python-mecab-ko python-mecab-ko-dic num2words anyascii # MeloTTS frontendRestart ComfyUI, then in the graph:
- Add
Voice TTS 🎙️→ choose anengine→ type text. - Wire its
AUDIOoutput into the coreSave Audio(orPreview Audio). - For STT, add
Voice ASR (STT) 🎙️, feed it anyAUDIO, read the transcript.
Not sure what’s installed? Drop a Voice Engine Info 🎙️ node — it lists every
engine, its capabilities, and a pip install … hint for anything not yet enabled.
Some heavyweight engines (e.g. CosyVoice 3.0) need a one‑time model/code setup. Each adapter’s docstring in
comfyui_voice/engines/documents its exact steps.
| Node | Category | In → Out |
|---|---|---|
| Voice TTS | audio/voice/tts |
text (+ optional VOICE_REF) → AUDIO |
| Voice ASR (STT) | audio/asr |
AUDIO → VOICE_TRANSCRIPT + text |
| Voice Music Gen | audio/generate/music |
text (+ duration/seed) → AUDIO |
| Voice SFX Gen | audio/generate/sfx |
text (+ duration/seed) → AUDIO |
| Voice Engine Info | audio/voice/util |
— → engine/capability report |
Voice cloning is just wiring a reference clip (via core Load Audio) into the
TTS node’s voice_ref input — no special upload step.
The design borrows ComfyUI core’s own philosophy — reimplement/port model code to run on one torch rather than pip‑installing a stack per model.
- Composable typed sockets. Core
AUDIO({"waveform": [B,C,T], "sample_rate"}) plus a small set of namespacedVOICE_*types (VOICE_REF,VOICE_TRANSCRIPT,VOICE_STEMS, …) make nodes interoperable and future‑proof. - Capability‑driven. Each engine declares one
EngineCapabilitiesdataclass (languages, cloning, sample rate, isolation, license, param schema). The form, validation, and tool surface are generated from it — noif engine == …branching anywhere. - Two adapter tiers.
- Native (
inproc, preferred): runs on the host torch — via Transformers (the architecture lib already present) or by vendoring the upstream inference code with small pure‑python deps + load‑time shims. All verified models use this. - Isolated (
subprocess): for engines whose pinned deps genuinely conflict — they run in a per‑engine venv behind a uniform worker protocol. Opt‑in via a singleisolation="subprocess"field.
- Native (
- Suite‑owned model manager tracks raw model VRAM (which ComfyUI can’t see) and evicts across nodes; a sample‑rate guard prevents silent corruption when chaining.
Copy comfyui_voice/engines/tts/_reference_tone.py and fill it in:
@register_engine("my_tts")
class MyTTS(BaseEngine):
CAPS = EngineCapabilities(
id="my_tts", display_name="My TTS",
tasks=("tts",), license="Apache-2.0", commercial_safe=True,
supports_cloning=True, languages=("en", "ko"),
sample_rate=24000, isolation="inproc", # or "subprocess" if pins conflict
pip_install=("my-model",), probe_import=("my_model",),
param_schema={"params": {...}},
)
def load(self):
import my_model # lazy-import heavy deps HERE
self.m = my_model.load(...)
def generate(self, task, req):
wav, sr = self.m.tts(req["text"], ...)
return {"waveform": wav, "sample_rate": sr}
def unload(self):
del self.mThat’s the whole contract — it appears in the dropdown automatically, the form
adapts to its param_schema, and validation honors its capabilities. No core
files to edit.
Before adding deps to the main env, run
pip install --dry-runand confirm it doesn’t move ComfyUI’storch/transformers/numpy. If it would, declareisolation="subprocess".
- Core framework, typed sockets, capability registry, isolation runtime
- TTS + ASR nodes; verified native TTS (preset & zero‑shot clone) and ASR
- More verified engines (Qwen3‑TTS, Chatterbox, …)
-
VOICE_REFcloning UX + voice library - Voice conversion (RVC / Seed‑VC)
- Music / SFX generation (ACE‑Step 1.5 · MOSS‑SoundEffect v2.0)
- Source separation, denoise/enhance, forced alignment & subtitles
- Audio editing / inpainting
PRs and new engine adapters are very welcome. A good contribution:
- Is a single adapter file under
comfyui_voice/engines/<task>/. - Passes the pure‑core tests:
python tests/test_core.py(no models needed). - Keeps the host stack intact (
pip install --dry-runclean, orsubprocess). - Declares an accurate license +
commercial_safeflag.
ComfyUI-Voice is released under the Apache‑2.0 license. Each wrapped model keeps its own license — see the table above and the per‑engine adapter; some are non‑commercial or use‑restricted. You are responsible for complying with the license of any model you enable.
Built on the shoulders of the open‑source speech & audio community — MeloTTS, CosyVoice, Supertonic, MMS, OpenAI Whisper / faster‑whisper, Kokoro, Chatterbox, Qwen, OuteTTS, SenseVoice, WhisperX, ACE‑Step, MOSS‑SoundEffect (OpenMOSS), and ComfyUI itself. Thank you. 🙏