Skip to content

A clean, efficient ComfyUI custom node for VoxCPM TTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM model.

License

Notifications You must be signed in to change notification settings

ComfyNodePRs/PR-ComfyUI-VoxCPMTTS-b0727f6e

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ComfyUI VoxCPMTTS Node

A clean, efficient ComfyUI custom node for VoxCPMTTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM model.

VoxCPMTTX_v1.0.0

Features

  • 🎯 High-Quality TTS: Generate natural-sounding speech from text
  • 🎭 Voice Cloning: Clone any voice using a reference audio sample
  • πŸ”„ Auto-Transcription: Automatic speech recognition for reference audio
  • ⚑ Multi-Device Support: CUDA, MPS, and CPU compatibility
  • πŸŽ›οΈ Fine-Tuned Control: Adjustable guidance scale, inference steps, and more
  • πŸ”Š Audio Post-Processing: Built-in fade-in to reduce artifacts

Installation

Method 1: ComfyUI Manager (Recommended)

  1. Open ComfyUI Manager
  2. Search for "VoxCPMTTS"
  3. Install the node

Method 2: Manual Installation

  1. Navigate to your ComfyUI custom nodes directory:
cd ComfyUI/custom_nodes/
  1. Clone this repository:
git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git
  1. Install dependencies:
cd ComfyUI-VoxCPMTTS
pip install -r requirements.txt
  1. Restart ComfyUI

Dependencies

The node will automatically install required dependencies on first use:

  • huggingface_hub>=0.20.0
  • einops>=0.6.0
  • pydantic>=2.0.0
  • wetext>=0.1.0
  • openai-whisper>=20231117

Optional dependencies for enhanced functionality:

  • faster-whisper (preferred for ASR)
  • openai-whisper (fallback for ASR)

Model Download

The VoxCPM-0.5B model (~1.2GB) will be automatically downloaded to ComfyUI/models/TTS/VoxCPM-0.5B/ on first use. https://huggingface.co/openbmb/VoxCPM-0.5B

Usage

Text-to-Speech

  1. Add the VoxCPMTTS node to your workflow
  2. Input your text in the text field
  3. Adjust parameters as needed:
    • cfg_value: Controls adherence to prompt (1.0-10.0, default: 2.0)
    • inference_steps: Quality vs speed tradeoff (1-100, default: 10)
    • max_length: Maximum token length (256-8192, default: 4096)
  4. Connect the audio output to your desired destination

Voice Cloning

  1. Connect a reference audio to the reference_audio input
  2. Optionally provide reference_text (transcript of the reference audio)
    • If left empty, the node will automatically transcribe the audio
  3. The generated speech will mimic the reference voice characteristics

Parameters

Parameter Type Default Description
text STRING "Hello, this is VoxCPMTTS." Text to synthesize
cfg_value FLOAT 2.0 Guidance scale (higher = more prompt adherence)
inference_steps INT 10 Diffusion steps (higher = better quality)
max_length INT 4096 Maximum token length
normalize BOOLEAN True Enable text normalization
seed INT -1 Random seed (-1 for random)
device COMBO auto Device selection (auto/cuda/mps/cpu)
reference_audio AUDIO - Reference audio for voice cloning
reference_text STRING "" Reference audio transcript
fade_in_ms INT 20 Fade-in duration (0-1000ms)

Outputs

  • REFERENCE_TEXT: Transcribed or provided reference text
  • AUDIO: Generated speech audio with 16kHz sample rate

Environment Variables

Set these environment variables to customize behavior:

# ASR model size (tiny, small, medium, large)
export VOXCPM_ASR_MODEL=small

# Maximum retry attempts for bad cases
export VOXCPM_RETRY_MAX=2

Device Selection

  • auto: Automatically selects the best available device
  • cuda: Force CUDA if available
  • mps: Force MPS (Apple Silicon) if available
  • cpu: Force CPU processing

Example Workflows

Basic TTS Workflow

[Text Input] β†’ [VoxCPMTTS] β†’ [Audio Output]

Voice Cloning Workflow

[Reference Audio] β†’ [VoxCPMTTS] ← [Target Text]
                         ↓
                   [Cloned Audio]

Batch Processing Workflow

[Text List] β†’ [VoxCPMTTS] β†’ [Audio Batch] β†’ [Save Audio]

Performance Tips

Memory Optimization

  • Use lower inference_steps for faster generation
  • Choose appropriate max_length for your text
  • Use CPU device if GPU memory is limited

Quality Settings

  • Fast: cfg_value=1.5, inference_steps=5
  • Balanced: cfg_value=2.0, inference_steps=10 (default)
  • High Quality: cfg_value=3.0, inference_steps=20

Voice Cloning Tips

  • Use high-quality reference audio (16kHz+)
  • Reference audio should be 3-30 seconds long
  • Clear speech with minimal background noise works best
  • Provide accurate reference text when possible

Troubleshooting

Common Issues

Model download fails

  • Check internet connection
  • Ensure sufficient disk space (~1.2GB)
  • Try clearing the download cache

Out of memory errors

  • Reduce max_length
  • Lower inference_steps
  • Switch to CPU device
  • Close other GPU-intensive applications

Poor voice cloning quality

  • Ensure reference audio is clear and high-quality
  • Verify reference text accuracy
  • Try different cfg_value settings
  • Use reference audio from the same speaker

ASR transcription errors

  • Install faster-whisper for better performance
  • Provide manual reference_text instead
  • Use clearer reference audio

Debug Mode

Enable verbose logging by setting:

export COMFYUI_LOG_LEVEL=DEBUG

Model Information

This node uses the VoxCPM-0.5B model developed by OpenBMB:

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

  • OpenBMB for the VoxCPM model
  • ComfyUI community
  • All contributors and users

Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check the troubleshooting section above
  • Join the ComfyUI community discussions

Star this repository if you find it useful! ⭐

About

A clean, efficient ComfyUI custom node for VoxCPM TTS (Text-to-Speech) functionality. This implementation provides high-quality speech generation and voice cloning capabilities using the VoxCPM model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%