Skip to content

T5Voice is a lightweight PyTorch implementation of T5-based text-to-speech synthesis, supporting both streaming and non-streaming speech synthesis with zero-shot capabilities.

License

Notifications You must be signed in to change notification settings

SynthAether/T5Voice

 
 

Repository files navigation

T5Voice

Python PyTorch License

T5Voice is a lightweight PyTorch implementation of T5-based text-to-speech synthesis (also known as T5-TTS), supporting both streaming and non-streaming speech synthesis with zero-shot capabilities.

Recent studies [1][2] have shown that T5-based text-to-speech models can generate highly natural and intelligible speech by learning monotonic alignment over discrete audio codecs. We introduce T5Voice for researchers and developers to easily reproduce the training and inference processes of T5-TTS, and to extend it for further experiments and applications.

Features

  • High-Quality Speech: Generate natural-sounding speech using T5-based architecture
  • Zero-Shot Synthesis: Zero-shot speech synthesis using only a few seconds of reference audio
  • Streaming Generation: Real-time low-latency streaming generation with configurable chunk size
  • Multi-GPU Support: Multi-GPU training support with Accelerate for faster model training
  • Easy Deployment: One-key deployment on NVIDIA GPUs based on Triton Inference Server

Table of Contents

Architecture

T5Voice adapts the T5 architecture for text-to-speech synthesis. The model treats speech generation as a sequence-to-sequence task: text is first transformed into discrete audio codec tokens, then converted back to waveforms via an audio codec model. T5Voice also uses the attention prior and alignment CTC loss proposed in [1] to accelerate the learning of monotonic alignment.

Key Components

  • T5 Encoder: Encodes the input text (reference audio transcript concatenated with text to synthesize) into encoded features
  • T5 Decoder: Autoregressively generates codec tokens conditioned on encoded features and codec tokens of the reference audio
  • Codec Encoder: Converts waveforms at 22kHz into codec tokens. Used to extract codec tokens from the reference audio.
  • Codec Decoder: Converts codec tokens back to waveforms at 22kHz. Used to convert synthesized codec tokens to waveforms.

Architecture

Environment Setup

# Create and activate conda environment
conda create --name t5voice python==3.10.12
conda activate t5voice

# Install system dependencies
apt install libsndfile1-dev espeak-ng -y

# Install Python dependencies
pip install -r requirements.txt

Dataset Preparation

Download the LibriTTS and Hi-Fi TTS datasets:

# LibriTTS
bash t5voice/preprocess/download_libritts.sh ./datasets/libritts

# Hi-Fi TTS
bash t5voice/preprocess/download_hifitts.sh ./datasets/hifitts

Data Preprocessing

Preprocess the datasets to extract mel codec features:

# Hi-Fi TTS
python -m t5voice.preprocess.preprocess_hifitts \
  --dataset-dir=./datasets/hifitts/hi_fi_tts_v0 \
  --output-dir=./datasets/hifitts/hi_fi_tts_v0/codec \
  --codec-model=mel_codec_22khz_medium \
  --batch-size=32

# LibriTTS
python -m t5voice.preprocess.preprocess_libritts \
  --dataset-dir=./datasets/libritts/LibriTTS/ \
  --output-dir=./datasets/libritts/LibriTTS/codec \
  --codec-model=mel_codec_22khz_medium \
  --batch-size=32

Filelist Generation

Generate train/dev/test splits for each dataset:

# Hi-Fi TTS
mkdir -p ./t5voice/filelists/hifitts/
python -m t5voice.generate_filelists \
  --dataset_name=hifitts \
  --dataset_dir=./datasets/hifitts/hi_fi_tts_v0/ \
  --codec_dir=codec \
  --train_filelist=./t5voice/filelists/hifitts/train_filelist.txt \
  --test_filelist=./t5voice/filelists/hifitts/test_filelist.txt \
  --dev_filelist=./t5voice/filelists/hifitts/dev_filelist.txt

# LibriTTS
mkdir -p ./t5voice/filelists/libritts/
python -m t5voice.generate_filelists \
  --dataset_name=libritts \
  --dataset_dir=./datasets/libritts/LibriTTS/ \
  --codec_dir=codec \
  --train_filelist=./t5voice/filelists/libritts/train_filelist.txt \
  --test_filelist=./t5voice/filelists/libritts/test_filelist.txt \
  --dev_filelist=./t5voice/filelists/libritts/dev_filelist.txt

Training

Single-GPU Training

python -m t5voice.main --config-name=t5voice_base_libritts_hifitts
# or
bash train.sh

Multi-GPU Training

accelerate launch -m t5voice.main --config-name=t5voice_base_libritts_hifitts
# or
bash train_multi_gpu.sh

Resume Training from Checkpoint

python -m t5voice.main \
  --config-name=t5voice_base_libritts_hifitts \
  model.restore_from=./logs/[DATE_DIR]/[TIME_DIR]/checkpoint-pt-10000/

Inference

Non-Streaming Synthesis

python -m t5voice.inference \
    --config-name=t5voice_base_libritts_hifitts \
    hydra.run.dir=. \
    hydra.output_subdir=null \
    hydra/job_logging=disabled \
    hydra/hydra_logging=disabled \
    model.checkpoint_path="checkpoints/t5voice_base_libritts_hifitts/checkpoint-pt-250000/model.safetensors" \
    infer.use_logits_processors=true \
    infer.top_k=80 \
    infer.top_p=1.0 \
    infer.temperature=0.85 \
    +reference_audio_path="reference.wav" \
    +reference_audio_text_path="reference.txt" \
    +text_path="text.txt" \
    +output_audio_path="output_t5voice.wav" \
    +max_generation_steps=3000 \
    +use_cache=true \
    +streaming=false

Streaming Synthesis

For streaming generation, set streaming=true and specify chunk_size and overlap_size:

python -m t5voice.inference \
    --config-name=t5voice_base_libritts_hifitts \
    hydra.run.dir=. \
    hydra.output_subdir=null \
    hydra/job_logging=disabled \
    hydra/hydra_logging=disabled \
    model.checkpoint_path="checkpoints/t5voice_base_libritts_hifitts/checkpoint-pt-250000/model.safetensors" \
    infer.use_logits_processors=true \
    infer.top_k=80 \
    infer.top_p=1.0 \
    infer.temperature=0.85 \
    +reference_audio_path="reference.wav" \
    +reference_audio_text_path="reference.txt" \
    +text_path="text.txt" \
    +output_audio_path="output_t5voice.wav" \
    +max_generation_steps=3000 \
    +use_cache=true \
    +streaming=true \
    +chunk_size=50 \
    +overlap_size=2

# or
bash inference.sh

Monitoring

Monitor training progress using TensorBoard:

tensorboard --logdir=./logs/[DATE_DIR]/[TIME_DIR]/tensorboard/ --port=6007

Then open your browser and navigate to http://localhost:6007 to view training metrics, losses, and sample outputs.

On the SCALARS tab, you can view all the training and evaluation loss curves.

tensorboard_scalars

On the IMAGES tab, you can view the alignment visualizations (decoder cross-attention weights).

tensorboard_images

A well-trained model should exhibit a clear monotonic alignment.

On the AUDIO tab, you can listen to the ground-truth audio samples and their corresponding generated (predicted) audio.

tensorboard_audio

Note that the beginning of the predicted audio comes from the ground-truth audio and serves as the reference for the decoder. Please focus on the latter part of the audio to determine whether the model is able to generate intelligible speech.

Deployment

T5Voice supports one-click accelerated streaming deployment on GPUs using NVIDIA TensorRT and the Triton Inference Server.

First, export your trained model to ONNX format:

bash export.sh

After export, you will obtain the following ONNX models: t5_encoder.onnx, t5_decoder.onnx, codec_encode.onnx, and codec_decode.onnx.

You can run inference directly with the ONNX models to verify their correctness:

bash inference_onnx.sh

Copy all ONNX models into the deployment directory and navigate to it:

cp *.onnx deployment && cd deployment

This process is based on the official NVIDIA Triton Inference Server image. It installs all dependencies, builds TensorRT engines, and saves the result as a new Docker image named t5voice_tritonserver:latest.

bash build_server.sh

Start the T5Voice server:

bash run_server.sh

Then, run the T5Voice client to send a TTS request to the server and receive streaming responses:

pip install -r requirements_client.txt
bash run_client.sh

If everything works correctly, you will see output similar to the following:

[INFO] Running Triton client...

Server URL:     0.0.0.0:8001
Reference audio: reference.wav
Reference text:  I’m an assistant here to help with questions, provide information, and support you in various tasks,
Input text:      I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.
Output file:     output.wav

Connected to Triton server at 0.0.0.0:8001
Model 't5voice' is ready
Loading reference audio from: reference.wav
Reference audio shape: (140238,)
Reference text: I’m an assistant here to help with questions, provide information, and support you in various tasks,
Target text: I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.

Starting streaming inference...
============================================================
Chunks received: 11 | Time: 3.01s
============================================================
Streaming inference completed!

Saving audio to: output.wav

Statistics:
  Total chunks: 11
  Total samples: 138496
  Audio duration: 6.28 seconds
  Total time: 3.03 seconds
  First chunk latency: 296.65 ms
  Generation time: 2.70 seconds
  Real-time factor: 0.48

Done!

[INFO] Client request completed!
Generated audio saved to: output.wav

The first-chunk latency and real-time factor may vary depending on your GPU and CPU performance.

Checkpoints

If you would like to access pretrained checkpoints or pre-exported models for testing T5Voice, please fill out the form to request access.

Audio Samples

The following samples use different voices reading the sentence
“I’m an assistant here to help with questions, provide information, and support you in various tasks.”
as the reference audio, and synthesize the sentence
“I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.”

🎧 Female 1 👩 Click to listen Reference:
reference_female_1.mp4

Synthesized:

t5voice_output_female_1.mp4
🎧 Female 2 👩 Click to listen Reference:
reference_female_2.mp4

Synthesized:

t5voice_output_female_2.mp4
🎧 Female 3 👩 Click to listen Reference:
reference_female_3.mp4

Synthesized:

t5voice_output_female_3.mp4
🎧 Male 1 👨 Click to listen Reference:
reference_male_1.mp4

Synthesized:

t5voice_output_male_1.mp4
🎧 Male 2 👨 Click to listen Reference:
reference_male_2.mp4

Synthesized:

t5voice_output_male_2.mp4
🎧 Male 3 👨 Click to listen Reference:
reference_male_3.mp4

Synthesized:

t5voice_output_male_3.mp4

Citation

If you use T5Voice in your research, please cite:

@misc{t5voice,
  title={T5Voice: A Lightweight PyTorch Implementation of T5-based Text-to-Speech Synthesis},
  author={Muyang Du},
  year={2025},
  howpublished={\url{https://github.com/MuyangDu/T5Voice}}
}

Please also consider citing the T5-TTS papers listed in the References section.

Acknowledgements

Contact

If you have any questions, issues, or need technical support, please feel free to contact us by filling out this form.

References

To learn more about T5-based text-to-speech synthesis, please refer to the following papers:

[1] Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, and Boris Ginsburg. Improving Robustness of LLM-Based Speech Synthesis by Learning Monotonic Alignment. arXiv preprint arXiv:2406.17957, 2024. Link

[2] Eric Battenberg, R.J. Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, and David Kao. Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech. arXiv preprint arXiv:2410.22179, 2024. Link

Please note that T5Voice is not a strict reproduction of the above papers, and some implementation details may differ.

About

T5Voice is a lightweight PyTorch implementation of T5-based text-to-speech synthesis, supporting both streaming and non-streaming speech synthesis with zero-shot capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • Shell 5.3%
  • Dockerfile 0.4%