ComfyUI VoxCPMTTS Node

A clean, efficient ComfyUI custom node for VoxCPMTTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM model.

Features

🎯 High-Quality TTS: Generate natural-sounding speech from text
🎭 Voice Cloning: Clone any voice using a reference audio sample
🔄 Auto-Transcription: Automatic speech recognition for reference audio
⚡ Multi-Device Support: CUDA, MPS, and CPU compatibility
🎛️ Fine-Tuned Control: Adjustable guidance scale, inference steps, and more
🔊 Audio Post-Processing: Built-in fade-in to reduce artifacts

Installation

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager
Search for "VoxCPMTTS"
Install the node

Method 2: Manual Installation

Navigate to your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes/

Clone this repository:

git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git

Install dependencies:

cd ComfyUI-VoxCPMTTS
pip install -r requirements.txt

Restart ComfyUI

Dependencies

The node will automatically install required dependencies on first use:

huggingface_hub>=0.20.0
einops>=0.6.0
pydantic>=2.0.0
wetext>=0.1.0
openai-whisper>=20231117

Optional dependencies for enhanced functionality:

faster-whisper (preferred for ASR)
openai-whisper (fallback for ASR)

Model Download

The VoxCPM-0.5B model (~1.2GB) will be automatically downloaded to ComfyUI/models/TTS/VoxCPM-0.5B/ on first use. https://huggingface.co/openbmb/VoxCPM-0.5B

Usage

Text-to-Speech

Add the VoxCPMTTS node to your workflow
Input your text in the text field
Adjust parameters as needed:
- cfg_value: Controls adherence to prompt (1.0-10.0, default: 2.0)
- inference_steps: Quality vs speed tradeoff (1-100, default: 10)
- max_length: Maximum token length (256-8192, default: 4096)
Connect the audio output to your desired destination

Voice Cloning

Connect a reference audio to the reference_audio input
Optionally provide reference_text (transcript of the reference audio)
- If left empty, the node will automatically transcribe the audio
The generated speech will mimic the reference voice characteristics

Parameters

Parameter	Type	Default	Description
`text`	STRING	"Hello, this is VoxCPMTTS."	Text to synthesize
`cfg_value`	FLOAT	2.0	Guidance scale (higher = more prompt adherence)
`inference_steps`	INT	10	Diffusion steps (higher = better quality)
`max_length`	INT	4096	Maximum token length
`normalize`	BOOLEAN	True	Enable text normalization
`seed`	INT	-1	Random seed (-1 for random)
`device`	COMBO	auto	Device selection (auto/cuda/mps/cpu)
`reference_audio`	AUDIO	-	Reference audio for voice cloning
`reference_text`	STRING	""	Reference audio transcript
`fade_in_ms`	INT	20	Fade-in duration (0-1000ms)

Outputs

REFERENCE_TEXT: Transcribed or provided reference text
AUDIO: Generated speech audio with 16kHz sample rate

Environment Variables

Set these environment variables to customize behavior:

# ASR model size (tiny, small, medium, large)
export VOXCPM_ASR_MODEL=small

# Maximum retry attempts for bad cases
export VOXCPM_RETRY_MAX=2

Device Selection

auto: Automatically selects the best available device
cuda: Force CUDA if available
mps: Force MPS (Apple Silicon) if available
cpu: Force CPU processing

Example Workflows

Basic TTS Workflow

[Text Input] → [VoxCPMTTS] → [Audio Output]

Voice Cloning Workflow

[Reference Audio] → [VoxCPMTTS] ← [Target Text]
                         ↓
                   [Cloned Audio]

Batch Processing Workflow

[Text List] → [VoxCPMTTS] → [Audio Batch] → [Save Audio]

Performance Tips

Memory Optimization

Use lower inference_steps for faster generation
Choose appropriate max_length for your text
Use CPU device if GPU memory is limited

Quality Settings

Fast: cfg_value=1.5, inference_steps=5
Balanced: cfg_value=2.0, inference_steps=10 (default)
High Quality: cfg_value=3.0, inference_steps=20

Voice Cloning Tips

Use high-quality reference audio (16kHz+)
Reference audio should be 3-30 seconds long
Clear speech with minimal background noise works best
Provide accurate reference text when possible

Troubleshooting

Common Issues

Model download fails

Check internet connection
Ensure sufficient disk space (~1.2GB)
Try clearing the download cache

Out of memory errors

Reduce max_length
Lower inference_steps
Switch to CPU device
Close other GPU-intensive applications

Poor voice cloning quality

Ensure reference audio is clear and high-quality
Verify reference text accuracy
Try different cfg_value settings
Use reference audio from the same speaker

ASR transcription errors

Install faster-whisper for better performance
Provide manual reference_text instead
Use clearer reference audio

Debug Mode

Enable verbose logging by setting:

export COMFYUI_LOG_LEVEL=DEBUG

Model Information

This node uses the VoxCPM-0.5B model developed by OpenBMB:

Model Size: ~500M parameters
Audio Quality: 16kHz sample rate
Languages: Primarily optimized for English and Chinese
License: Apache 2.0
Paper: VoxCPM: A High-Quality Chinese Text-to-Speech Model

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

OpenBMB for the VoxCPM model
ComfyUI community
All contributors and users

Support

If you encounter any issues or have questions:

Open an issue on GitHub
Check the troubleshooting section above
Join the ComfyUI community discussions

Star this repository if you find it useful! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
example_workflows		example_workflows
voxcpm		voxcpm
AILab_VoxCPMTTS.py		AILab_VoxCPMTTS.py
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ComfyUI VoxCPMTTS Node

Features

Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Installation

Dependencies

Model Download

Usage

Text-to-Speech

Voice Cloning

Parameters

Outputs

Environment Variables

Device Selection

Example Workflows

Basic TTS Workflow

Voice Cloning Workflow

Batch Processing Workflow

Performance Tips

Memory Optimization

Quality Settings

Voice Cloning Tips

Troubleshooting

Common Issues

Debug Mode

Model Information

Contributing

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Languages

License

ComfyNodePRs/PR-ComfyUI-VoxCPMTTS-b0727f6e

Folders and files

Latest commit

History

Repository files navigation

ComfyUI VoxCPMTTS Node

Features

Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Installation

Dependencies

Model Download

Usage

Text-to-Speech

Voice Cloning

Parameters

Outputs

Environment Variables

Device Selection

Example Workflows

Basic TTS Workflow

Voice Cloning Workflow

Batch Processing Workflow

Performance Tips

Memory Optimization

Quality Settings

Voice Cloning Tips

Troubleshooting

Common Issues

Debug Mode

Model Information

Contributing

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages