ComfyUI-Voice 🎙️

The unified, extensible audio & voice suite for ComfyUI. TTS · voice cloning · speech‑to‑text — many state‑of‑the‑art open models, one clean node graph.

English · 한국어 · 中文 · 日本語

ComfyUI has world‑class image and video tooling — but audio has been a patchwork of one‑off nodes, each dragging in its own conflicting dependencies and often breaking your install. ComfyUI-Voice fills that gap: a single, coherent suite where every speech model is a drop‑in adapter behind one set of nodes, and everything runs natively on the PyTorch you already have — no second environment, no dependency hell.

text ─▶ [Voice TTS] ─▶ AUDIO ─▶ [Save Audio]            # synthesize
audio ─▶ [Voice ASR] ─▶ TRANSCRIPT + text               # transcribe
ref ──▶ [Voice TTS] ─▶ AUDIO                             # zero‑shot voice clone

✨ Highlights

🧩 One node per task, many engines. Pick an engine from a dropdown — the node’s inputs adapt to what that engine declares it can do.
🛡️ Runs on your torch. Engines load natively on ComfyUI’s existing PyTorch/Transformers. No version pins fighting your install, no separate venvs for the supported set. (Conflicting models can opt into isolation — but none of the verified models need it.)
🌍 Multilingual & SOTA. Wraps leading open models for speech synthesis, zero‑shot voice cloning, and transcription across dozens of languages — plus text‑to‑music and text‑to‑SFX generation.
🔌 Add a model in one file. A new engine is a single self‑registering adapter — no central dispatch, no UI plumbing to touch.
🧵 Composable. Everything speaks ComfyUI’s native AUDIO type, so it chains with the built‑in Load/Save/Preview Audio nodes for free.
✅ Honest status. Models are marked verified only after a real round‑trip test on this codebase.

🎧 Supported engines

Verified (✅) = real inference tested end‑to‑end (synthesize → transcribe round‑trip) on the host stack. Experimental (🧪) = adapter shipped, deps not yet installed/validated — enable via the Voice Engine Info node’s install hint.

Text‑to‑Speech

Engine (`id`)	Capability	Languages	License	Status
MeloTTS (`melotts_korean`)	Fast preset TTS	Korean¹	Apache‑2.0 / MIT	✅
CosyVoice 3.0 (`cosyvoice3`)	Zero‑shot voice clone + VC	zh · en · ja · ko · de · es · fr · it · ru	Apache‑2.0	✅
Supertonic 3 (`supertonic`)	On‑device preset TTS (ONNX)	30+ incl. ko	code MIT / model OpenRAIL‑M	✅
Higgs Audio v3 (4B) (`higgs_audio_v3`)	Expressive TTS + zero‑shot clone	100+ incl. ko·zh·ja	Research/Non‑Commercial ⚠️	✅ (eval)
Chatterbox (`chatterbox`)	Clone + emotion control	23 langs	MIT	🧪
Qwen3‑TTS (`qwen3_tts`)	Clone + voice design	10 langs	Apache‑2.0	🧪
OuteTTS 1.0 (`oute_tts`)	Compact LLM‑TTS	14 langs	Apache‑2.0	🧪

Speech‑to‑Text

Engine (`id`)	Capability	Languages	License	Status
faster‑whisper (`faster_whisper`)	Fast ASR + word timestamps	99 langs	MIT	✅
Korean Whisper (`korean_whisper`)	ASR (ko fine‑tune)	Korean	Apache‑2.0	🧪
Qwen3‑ASR (`qwen3_asr`)	ASR + forced‑aligner timestamps	30+ langs	Apache‑2.0	🧪
SenseVoice (`sensevoice`)	Very fast ASR + emotion/event tags	5+ langs	custom ⚠️	🧪
WhisperX (`whisperx`)	ASR + word alignment + diarization	whisper langs	BSD‑2 (+pyannote gated)	🧪

_{¹ The shipped checkpoint is Korean; the MeloTTS families are multilingual — additional language adapters are easy drop‑ins.}

Generative audio (music · SFX)

Engine (`id`)	Capability	Languages	License	Status
ACE‑Step 1.5 (`ace_step`)	Text‑to‑music (instrumental / song + lyrics)	50+ incl. ko·zh·ja·en	Apache‑2.0 / MIT	✅
MOSS‑SoundEffect v2.0 (`moss_soundeffect`)	Text‑to‑sound‑effect / Foley	en prompts	Apache‑2.0	✅

_{² Both run on the host torch and output 48 kHz AUDIO. ACE‑Step reuses ComfyUI core's native support; MOSS‑SoundEffect's inference code is vendored (no descript‑audiotools, no extra torch).}

Plus four dependency‑free reference engines (reference_tone, reference_asr, reference_music, reference_sfx) that let you smoke‑test the whole pipeline on a clean install and serve as the adapter template.

🚀 Quick start

# 1) Clone into your ComfyUI custom_nodes
cd ComfyUI/custom_nodes
git clone https://github.com/Streamize-llc/ComfyUI-Voice

# 2) Install an engine's deps (example: the workhorses)
pip install faster-whisper          # ASR
pip install librosa g2pkk jamo python-mecab-ko python-mecab-ko-dic num2words anyascii  # MeloTTS frontend

Restart ComfyUI, then in the graph:

Add Voice TTS 🎙️ → choose an engine → type text.
Wire its AUDIO output into the core Save Audio (or Preview Audio).
For STT, add Voice ASR (STT) 🎙️, feed it any AUDIO, read the transcript.

Not sure what’s installed? Drop a Voice Engine Info 🎙️ node — it lists every engine, its capabilities, and a pip install … hint for anything not yet enabled.

Some heavyweight engines (e.g. CosyVoice 3.0) need a one‑time model/code setup. Each adapter’s docstring in comfyui_voice/engines/ documents its exact steps.

🧩 Nodes

Node	Category	In → Out
Voice TTS	`audio/voice/tts`	text (+ optional `VOICE_REF`) → `AUDIO`
Voice ASR (STT)	`audio/asr`	`AUDIO` → `VOICE_TRANSCRIPT` + text
Voice Music Gen	`audio/generate/music`	text (+ duration/seed) → `AUDIO`
Voice SFX Gen	`audio/generate/sfx`	text (+ duration/seed) → `AUDIO`
Voice Engine Info	`audio/voice/util`	— → engine/capability report

Voice cloning is just wiring a reference clip (via core Load Audio) into the TTS node’s voice_ref input — no special upload step.

🏗️ Architecture

The design borrows ComfyUI core’s own philosophy — reimplement/port model code to run on one torch rather than pip‑installing a stack per model.

Composable typed sockets. Core AUDIO ({"waveform": [B,C,T], "sample_rate"}) plus a small set of namespaced VOICE_* types (VOICE_REF, VOICE_TRANSCRIPT, VOICE_STEMS, …) make nodes interoperable and future‑proof.
Capability‑driven. Each engine declares one EngineCapabilities dataclass (languages, cloning, sample rate, isolation, license, param schema). The form, validation, and tool surface are generated from it — no if engine == … branching anywhere.
Two adapter tiers.
1. Native (inproc, preferred): runs on the host torch — via Transformers (the architecture lib already present) or by vendoring the upstream inference code with small pure‑python deps + load‑time shims. All verified models use this.
2. Isolated (subprocess): for engines whose pinned deps genuinely conflict — they run in a per‑engine venv behind a uniform worker protocol. Opt‑in via a single isolation="subprocess" field.
Suite‑owned model manager tracks raw model VRAM (which ComfyUI can’t see) and evicts across nodes; a sample‑rate guard prevents silent corruption when chaining.

➕ Add an engine in one file

Copy comfyui_voice/engines/tts/_reference_tone.py and fill it in:

@register_engine("my_tts")
class MyTTS(BaseEngine):
    CAPS = EngineCapabilities(
        id="my_tts", display_name="My TTS",
        tasks=("tts",), license="Apache-2.0", commercial_safe=True,
        supports_cloning=True, languages=("en", "ko"),
        sample_rate=24000, isolation="inproc",   # or "subprocess" if pins conflict
        pip_install=("my-model",), probe_import=("my_model",),
        param_schema={"params": {...}},
    )
    def load(self):
        import my_model                     # lazy-import heavy deps HERE
        self.m = my_model.load(...)
    def generate(self, task, req):
        wav, sr = self.m.tts(req["text"], ...)
        return {"waveform": wav, "sample_rate": sr}
    def unload(self):
        del self.m

That’s the whole contract — it appears in the dropdown automatically, the form adapts to its param_schema, and validation honors its capabilities. No core files to edit.

Before adding deps to the main env, run pip install --dry-run and confirm it doesn’t move ComfyUI’s torch / transformers / numpy. If it would, declare isolation="subprocess".

🗺️ Roadmap

Core framework, typed sockets, capability registry, isolation runtime
TTS + ASR nodes; verified native TTS (preset & zero‑shot clone) and ASR
More verified engines (Qwen3‑TTS, Chatterbox, …)
VOICE_REF cloning UX + voice library
Voice conversion (RVC / Seed‑VC)
Music / SFX generation (ACE‑Step 1.5 · MOSS‑SoundEffect v2.0)
Source separation, denoise/enhance, forced alignment & subtitles
Audio editing / inpainting

🤝 Contributing

PRs and new engine adapters are very welcome. A good contribution:

Is a single adapter file under comfyui_voice/engines/<task>/.
Passes the pure‑core tests: python tests/test_core.py (no models needed).
Keeps the host stack intact (pip install --dry-run clean, or subprocess).
Declares an accurate license + commercial_safe flag.

📜 License & credits

ComfyUI-Voice is released under the Apache‑2.0 license. Each wrapped model keeps its own license — see the table above and the per‑engine adapter; some are non‑commercial or use‑restricted. You are responsible for complying with the license of any model you enable.

Built on the shoulders of the open‑source speech & audio community — MeloTTS, CosyVoice, Supertonic, MMS, OpenAI Whisper / faster‑whisper, Kokoro, Chatterbox, Qwen, OuteTTS, SenseVoice, WhisperX, ACE‑Step, MOSS‑SoundEffect (OpenMOSS), and ComfyUI itself. Thank you. 🙏

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
comfyui_voice		comfyui_voice
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh.md		README.zh.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ComfyUI-Voice 🎙️

✨ Highlights

🎧 Supported engines

Text‑to‑Speech

Speech‑to‑Text

Generative audio (music · SFX)

🚀 Quick start

🧩 Nodes

🏗️ Architecture

➕ Add an engine in one file

🗺️ Roadmap

🤝 Contributing

📜 License & credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-Voice 🎙️

✨ Highlights

🎧 Supported engines

Text‑to‑Speech

Speech‑to‑Text

Generative audio (music · SFX)

🚀 Quick start

🧩 Nodes

🏗️ Architecture

➕ Add an engine in one file

🗺️ Roadmap

🤝 Contributing

📜 License & credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages