ComfyUI-VibeVoice

A custom node for ComfyUI that integrates Microsoft's VibeVoice, a frontier model for generating expressive, long-form, multi-speaker conversational audio.

Report Bug · Request Feature

About The Project

This project brings the power of VibeVoice into the modular workflow of ComfyUI. VibeVoice is a novel framework by Microsoft for generating expressive, long-form, multi-speaker conversational audio. It excels at creating natural-sounding dialogue, podcasts, and more, with consistent voices for up to 4 speakers.

The custom node handles everything from model downloading and memory management to audio processing, allowing you to generate high-quality speech directly from a text script and reference audio files.

✨ Key Features:

Multi-Speaker TTS: Generate conversations with up to 4 distinct voices in a single audio output.
Zero-Shot Voice Cloning: Use any audio file (.wav, .mp3) as a reference for a speaker's voice.
Advanced Attention Mechanisms: Choose between eager, sdpa, flash_attention_2, and the new high-performance sage attention for fine-tuned control over speed, memory, and compatibility.
Robust 4-Bit Quantization: Run the large language model component in 4-bit mode to significantly reduce VRAM usage, with smart, stable configurations for all attention modes.
Automatic Model Management: Models are downloaded automatically and managed efficiently by ComfyUI to save VRAM.
Fine-Grained Control: Adjust parameters like CFG scale, temperature, and sampling methods to tune the performance and style of the generated speech.

(back to top)

🚀 Getting Started

The node can be installed via ComfyUI Manager: Find ComfyUI-VibeVoice and click "Install".

Alternatively, install it manually:

Clone the Repository: Navigate to your ComfyUI/custom_nodes/ directory and clone this repository:
```
git clone https://github.com/wildminder/ComfyUI-VibeVoice.git
```
Install Dependencies: Open a terminal or command prompt, navigate into the cloned directory, and install the required Python packages. For quantization support, ensure you install bitsandbytes.
```
cd ComfyUI-VibeVoice
pip install -r requirements.txt
```
Optional: Install SageAttention To enable the new sage attention mode, you must install the sageattention library in your ComfyUI Python environment. For Windows users, please refer to this AI-windows-whl for the required package.

Note: This is only required if you intend to use the sage attention mode.
Start/Restart ComfyUI: Launch ComfyUI. The "VibeVoice TTS" node will appear under the audio/tts category. The first time you use the node, it will automatically download the selected model to your ComfyUI/models/tts/VibeVoice/ folder.

Models

Model	Context Length	Generation Length	Weight
VibeVoice-1.5B	64K	~90 min	HF link
VibeVoice-Large	32K	~45 min	HF link

(back to top)

🛠️ Usage

The node is designed to be intuitive within the ComfyUI workflow.

Add Nodes: Add the VibeVoice TTS node to your graph. Use ComfyUI's built-in Load Audio node to load your reference voice files.
Connect Voices: Connect the AUDIO output from each Load Audio node to the corresponding speaker_*_voice input on the VibeVoice TTS node.
Write Script: In the text input, write your dialogue. Assign lines to speakers using the format Speaker 1: ..., Speaker 2: ..., etc., on separate lines.
Generate: Queue the prompt. The node will process the script and generate a single audio file containing the full conversation.

Tip: For a complete workflow, you can drag the example image from the example_workflows folder onto your ComfyUI canvas.

Node Inputs

model_name: Select the VibeVoice model to use (1.5B or Large).
quantize_llm_4bit: (Overhauled!) Enable to run the LLM component in 4-bit (NF4) mode. This dramatically reduces VRAM usage.
attention_mode: (New!) Select the attention implementation: eager (safest), sdpa (balanced), flash_attention_2 (fastest), or sage (quantized high-performance).
text: The conversational script. Lines must be prefixed with Speaker <number>: (e.g., Speaker 1:).
cfg_scale: Controls how strongly the model adheres to the reference voice's timbre.
inference_steps: Number of diffusion steps for the audio decoder.
seed: A seed for reproducibility.
do_sample, temperature, top_p, top_k: Standard sampling parameters for controlling the creativity and determinism of the speech generation.
force_offload: Forces the model to be completely offloaded from VRAM after generation.

⚙️ Performance & Advanced Features

This update introduces a sophisticated system for managing performance, memory, and stability. The node will automatically select the best configuration based on your choices.

Feature Compatibility & VRAM Matrix

Quantize LLM	Attention Mode	Behavior / Notes	Relative VRAM
OFF	`eager`	Full Precision. Most compatible baseline.	High
OFF	`sdpa`	Full Precision. Recommended for balanced performance.	High
OFF	`flash_attention_2`	Full Precision. High performance on compatible GPUs.	High
OFF	`sage`	Full Precision. Uses high-performance mixed-precision kernels.	High
ON	`eager`	Falls back to `sdpa` with `bfloat16` compute. Warns user.	Low
ON	`sdpa`	Recommended for memory savings. Uses `bfloat16` compute.	Low
ON	`flash_attention_2`	Falls back to `sdpa` with `bfloat16` compute. Warns user.	Low
ON	`sage`	Recommended for stability. Uses `fp32` compute to ensure numerical stability with quantization, resulting in slightly higher VRAM usage.	Medium

A key feature of this node is the optional 4-bit quantization for the language model component. This is highly recommended for users with memory-constrained GPUs (e.g., <= 16GB VRAM) who wish to run the larger VibeVoice-Large-pt model.

Benefits of quantize_llm = Enabled:

Model	Performance Impact	VRAM Savings
VibeVoice-Large (7B)	~8.5x faster inference	Saves >4.4 GB (over 36%)
VibeVoice-1.5B	~1.5x slower inference	Saves ~5.5 GB (over 63%)

As shown, quantization provides a massive speedup and VRAM reduction for the 7B model, making it accessible on a wider range of hardware. While it slightly slows down the 1.5B model, the significant VRAM savings may still be beneficial for complex workflows.

*Note: flash_attention_2 with Q4 automatically falls back to sdpa.

Changelog

v1.3.0 - SageAttention & Quantization Overhaul

✨ New Features

SageAttention Support: Full integration with the sageattention library for a high-performance, mixed-precision attention option.
Robust 4-Bit LLM Quantization: The "Quantize LLM (4-bit)" option is now highly stable and delivers significant VRAM savings.
Smart Configuration & Fallbacks: The node now automatically handles incompatible settings (e.g., 4-bit with flash_attention_2) by gracefully falling back to a stable alternative (sdpa) and notifying the user.

🐛 Bug Fixes & Stability Improvements

Fixed SageAttention Crashes
Fixed Numerical Instability (NaN/Inf Errors)
Resolved All dtype Mismatches
Corrected SageAttention Kernel Assertions
Addressed Deprecation Warning

v1.2.0 - Compatibility Update

✅ Compatibility

Transformers Library: Includes automatic detection and compatibility for both older and newer versions of the Transformers library (pre- and post-4.56).

🐛 Bug Fixes

Force Offload: Resolved an AttributeError to ensure the force offload option works correctly with all versions of ComfyUI.
Multi-Speaker DynamicCache: Fixed a 'DynamicCache' object has no attribute 'key_cache' error when using multiple speakers with newer versions of the Transformers library.

(back to top)

Tips from the Original Authors

Punctuation: For Chinese text, using English punctuation (commas and periods) can improve stability.
Model Choice: The 7B model variant (VibeVoice-Large) is generally more stable.
Spontaneous Sounds/Music: The model may spontaneously generate background music, especially if the reference audio contains it or if the text includes introductory phrases like "Welcome to...". This is an emergent capability and cannot be directly controlled.
Singing: The model was not trained on singing data, but it may attempt to sing as an emergent behavior. Results may vary.

(back to top)

License

This project is distributed under the MIT License. See LICENSE.txt for more information. The VibeVoice model and its components are subject to the licenses provided by Microsoft. Please use responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
example_workflows		example_workflows
modules		modules
vibevoice		vibevoice
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
vibevoice_nodes.py		vibevoice_nodes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ComfyUI-VibeVoice

About The Project

🚀 Getting Started

Models

🛠️ Usage

Node Inputs

⚙️ Performance & Advanced Features

Feature Compatibility & VRAM Matrix

Changelog

✨ New Features

🐛 Bug Fixes & Stability Improvements

✅ Compatibility

🐛 Bug Fixes

Tips from the Original Authors

License

Acknowledgments

Star History

About

Uh oh!

Releases 5

Contributors 5

Languages

License

wildminder/ComfyUI-VibeVoice

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-VibeVoice

About The Project

🚀 Getting Started

Models

🛠️ Usage

Node Inputs

⚙️ Performance & Advanced Features

Feature Compatibility & VRAM Matrix

Changelog

✨ New Features

🐛 Bug Fixes & Stability Improvements

✅ Compatibility

🐛 Bug Fixes

Tips from the Original Authors

License

Acknowledgments

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors 5

Languages