T5Voice is a lightweight PyTorch implementation of T5-based text-to-speech synthesis (also known as T5-TTS), supporting both streaming and non-streaming speech synthesis with zero-shot capabilities.
Recent studies [1][2] have shown that T5-based text-to-speech models can generate highly natural and intelligible speech by learning monotonic alignment over discrete audio codecs. We introduce T5Voice for researchers and developers to easily reproduce the training and inference processes of T5-TTS, and to extend it for further experiments and applications.
- High-Quality Speech: Generate natural-sounding speech using T5-based architecture
- Zero-Shot Synthesis: Zero-shot speech synthesis using only a few seconds of reference audio
- Streaming Generation: Real-time low-latency streaming generation with configurable chunk size
- Multi-GPU Support: Multi-GPU training support with Accelerate for faster model training
- Easy Deployment: One-key deployment on NVIDIA GPUs based on Triton Inference Server
- Architecture
- Environment Setup
- Dataset Preparation
- Data Preprocessing
- Filelist Generation
- Training
- Inference
- Monitoring
- Deployment
- Checkpoints
- Audio Samples
- Citation
- Acknowledgements
- Contact
- References
T5Voice adapts the T5 architecture for text-to-speech synthesis. The model treats speech generation as a sequence-to-sequence task: text is first transformed into discrete audio codec tokens, then converted back to waveforms via an audio codec model. T5Voice also uses the attention prior and alignment CTC loss proposed in [1] to accelerate the learning of monotonic alignment.
- T5 Encoder: Encodes the input text (reference audio transcript concatenated with text to synthesize) into encoded features
- T5 Decoder: Autoregressively generates codec tokens conditioned on encoded features and codec tokens of the reference audio
- Codec Encoder: Converts waveforms at 22kHz into codec tokens. Used to extract codec tokens from the reference audio.
- Codec Decoder: Converts codec tokens back to waveforms at 22kHz. Used to convert synthesized codec tokens to waveforms.
# Create and activate conda environment
conda create --name t5voice python==3.10.12
conda activate t5voice
# Install system dependencies
apt install libsndfile1-dev espeak-ng -y
# Install Python dependencies
pip install -r requirements.txtDownload the LibriTTS and Hi-Fi TTS datasets:
# LibriTTS
bash t5voice/preprocess/download_libritts.sh ./datasets/libritts
# Hi-Fi TTS
bash t5voice/preprocess/download_hifitts.sh ./datasets/hifittsPreprocess the datasets to extract mel codec features:
# Hi-Fi TTS
python -m t5voice.preprocess.preprocess_hifitts \
--dataset-dir=./datasets/hifitts/hi_fi_tts_v0 \
--output-dir=./datasets/hifitts/hi_fi_tts_v0/codec \
--codec-model=mel_codec_22khz_medium \
--batch-size=32
# LibriTTS
python -m t5voice.preprocess.preprocess_libritts \
--dataset-dir=./datasets/libritts/LibriTTS/ \
--output-dir=./datasets/libritts/LibriTTS/codec \
--codec-model=mel_codec_22khz_medium \
--batch-size=32Generate train/dev/test splits for each dataset:
# Hi-Fi TTS
mkdir -p ./t5voice/filelists/hifitts/
python -m t5voice.generate_filelists \
--dataset_name=hifitts \
--dataset_dir=./datasets/hifitts/hi_fi_tts_v0/ \
--codec_dir=codec \
--train_filelist=./t5voice/filelists/hifitts/train_filelist.txt \
--test_filelist=./t5voice/filelists/hifitts/test_filelist.txt \
--dev_filelist=./t5voice/filelists/hifitts/dev_filelist.txt
# LibriTTS
mkdir -p ./t5voice/filelists/libritts/
python -m t5voice.generate_filelists \
--dataset_name=libritts \
--dataset_dir=./datasets/libritts/LibriTTS/ \
--codec_dir=codec \
--train_filelist=./t5voice/filelists/libritts/train_filelist.txt \
--test_filelist=./t5voice/filelists/libritts/test_filelist.txt \
--dev_filelist=./t5voice/filelists/libritts/dev_filelist.txtpython -m t5voice.main --config-name=t5voice_base_libritts_hifitts
# or
bash train.shaccelerate launch -m t5voice.main --config-name=t5voice_base_libritts_hifitts
# or
bash train_multi_gpu.shpython -m t5voice.main \
--config-name=t5voice_base_libritts_hifitts \
model.restore_from=./logs/[DATE_DIR]/[TIME_DIR]/checkpoint-pt-10000/python -m t5voice.inference \
--config-name=t5voice_base_libritts_hifitts \
hydra.run.dir=. \
hydra.output_subdir=null \
hydra/job_logging=disabled \
hydra/hydra_logging=disabled \
model.checkpoint_path="checkpoints/t5voice_base_libritts_hifitts/checkpoint-pt-250000/model.safetensors" \
infer.use_logits_processors=true \
infer.top_k=80 \
infer.top_p=1.0 \
infer.temperature=0.85 \
+reference_audio_path="reference.wav" \
+reference_audio_text_path="reference.txt" \
+text_path="text.txt" \
+output_audio_path="output_t5voice.wav" \
+max_generation_steps=3000 \
+use_cache=true \
+streaming=falseFor streaming generation, set streaming=true and specify chunk_size and overlap_size:
python -m t5voice.inference \
--config-name=t5voice_base_libritts_hifitts \
hydra.run.dir=. \
hydra.output_subdir=null \
hydra/job_logging=disabled \
hydra/hydra_logging=disabled \
model.checkpoint_path="checkpoints/t5voice_base_libritts_hifitts/checkpoint-pt-250000/model.safetensors" \
infer.use_logits_processors=true \
infer.top_k=80 \
infer.top_p=1.0 \
infer.temperature=0.85 \
+reference_audio_path="reference.wav" \
+reference_audio_text_path="reference.txt" \
+text_path="text.txt" \
+output_audio_path="output_t5voice.wav" \
+max_generation_steps=3000 \
+use_cache=true \
+streaming=true \
+chunk_size=50 \
+overlap_size=2
# or
bash inference.shMonitor training progress using TensorBoard:
tensorboard --logdir=./logs/[DATE_DIR]/[TIME_DIR]/tensorboard/ --port=6007Then open your browser and navigate to http://localhost:6007 to view training metrics, losses, and sample outputs.
On the SCALARS tab, you can view all the training and evaluation loss curves.
On the IMAGES tab, you can view the alignment visualizations (decoder cross-attention weights).
A well-trained model should exhibit a clear monotonic alignment.
On the AUDIO tab, you can listen to the ground-truth audio samples and their corresponding generated (predicted) audio.
Note that the beginning of the predicted audio comes from the ground-truth audio and serves as the reference for the decoder. Please focus on the latter part of the audio to determine whether the model is able to generate intelligible speech.
T5Voice supports one-click accelerated streaming deployment on GPUs using NVIDIA TensorRT and the Triton Inference Server.
First, export your trained model to ONNX format:
bash export.shAfter export, you will obtain the following ONNX models: t5_encoder.onnx, t5_decoder.onnx, codec_encode.onnx, and codec_decode.onnx.
You can run inference directly with the ONNX models to verify their correctness:
bash inference_onnx.sh
Copy all ONNX models into the deployment directory and navigate to it:
cp *.onnx deployment && cd deploymentThis process is based on the official NVIDIA Triton Inference Server image. It installs all dependencies, builds TensorRT engines, and saves the result as a new Docker image named t5voice_tritonserver:latest.
bash build_server.shStart the T5Voice server:
bash run_server.shThen, run the T5Voice client to send a TTS request to the server and receive streaming responses:
pip install -r requirements_client.txt
bash run_client.shIf everything works correctly, you will see output similar to the following:
[INFO] Running Triton client...
Server URL: 0.0.0.0:8001
Reference audio: reference.wav
Reference text: I’m an assistant here to help with questions, provide information, and support you in various tasks,
Input text: I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.
Output file: output.wav
Connected to Triton server at 0.0.0.0:8001
Model 't5voice' is ready
Loading reference audio from: reference.wav
Reference audio shape: (140238,)
Reference text: I’m an assistant here to help with questions, provide information, and support you in various tasks,
Target text: I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.
Starting streaming inference...
============================================================
Chunks received: 11 | Time: 3.01s
============================================================
Streaming inference completed!
Saving audio to: output.wav
Statistics:
Total chunks: 11
Total samples: 138496
Audio duration: 6.28 seconds
Total time: 3.03 seconds
First chunk latency: 296.65 ms
Generation time: 2.70 seconds
Real-time factor: 0.48
Done!
[INFO] Client request completed!
Generated audio saved to: output.wavThe first-chunk latency and real-time factor may vary depending on your GPU and CPU performance.
If you would like to access pretrained checkpoints or pre-exported models for testing T5Voice, please fill out the form to request access.
The following samples use different voices reading the sentence
“I’m an assistant here to help with questions, provide information, and support you in various tasks.”
as the reference audio, and synthesize the sentence
“I can also offer suggestions, clarify complex topics, and make problem solving easier and more efficient.”
🎧 Female 1 👩 Click to listen
Reference:reference_female_1.mp4
Synthesized:
t5voice_output_female_1.mp4
🎧 Female 2 👩 Click to listen
Reference:reference_female_2.mp4
Synthesized:
t5voice_output_female_2.mp4
🎧 Female 3 👩 Click to listen
Reference:reference_female_3.mp4
Synthesized:
t5voice_output_female_3.mp4
🎧 Male 1 👨 Click to listen
Reference:reference_male_1.mp4
Synthesized:
t5voice_output_male_1.mp4
🎧 Male 2 👨 Click to listen
Reference:reference_male_2.mp4
Synthesized:
t5voice_output_male_2.mp4
🎧 Male 3 👨 Click to listen
Reference:reference_male_3.mp4
Synthesized:
t5voice_output_male_3.mp4
If you use T5Voice in your research, please cite:
@misc{t5voice,
title={T5Voice: A Lightweight PyTorch Implementation of T5-based Text-to-Speech Synthesis},
author={Muyang Du},
year={2025},
howpublished={\url{https://github.com/MuyangDu/T5Voice}}
}Please also consider citing the T5-TTS papers listed in the References section.
- T5Voice is implemented based on nanoT5 (Encoder-Decoder / Pre-training + Fine-Tuning).
- T5Voice uses pretrained audio codec models from NVIDIA NeMo Framework.
- For enterprise-level speech synthesis, we recommend using NVIDIA Riva Magpie-TTS.
If you have any questions, issues, or need technical support, please feel free to contact us by filling out this form.
To learn more about T5-based text-to-speech synthesis, please refer to the following papers:
[1] Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, and Boris Ginsburg. Improving Robustness of LLM-Based Speech Synthesis by Learning Monotonic Alignment. arXiv preprint arXiv:2406.17957, 2024. Link
[2] Eric Battenberg, R.J. Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, and David Kao. Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech. arXiv preprint arXiv:2410.22179, 2024. Link
Please note that T5Voice is not a strict reproduction of the above papers, and some implementation details may differ.



