A FastAPI application that provides audio processing capabilities for converting audio files to text (speech-to-text) and text to audio files (text-to-speech).
- 🎵 Audio to Text: Convert audio files to text using OpenAI Whisper
- 🔊 Text to Audio: Convert text to audio files using VoxCPM-0.5B model
- 📊 Health Monitoring: Health check endpoint to monitor service status
- 🚀 Fast API: Built with FastAPI for high performance and automatic API documentation
- đź§Ş Comprehensive Testing: Full test suite with pytest
- WAV
- MP3
- M4A
- OGG
- FLAC
- Python 3.12+
- uv package manager
- Clone the repository:
git clone <repository-url>
cd speakify- Install dependencies:
uv sync- Activate the virtual environment:
source .venv/bin/activateuvicorn main:app --reloaduvicorn main:app --host 0.0.0.0 --port 9876python main.pyThe application will be available at:
- API: http://localhost:9876
- Interactive API docs: http://localhost:9876/docs
- Alternative docs: http://localhost:9876/redoc
Returns basic API information.
Response:
{
"message": "Speakify API - Audio to Text and Text to Audio"
}Health check endpoint that returns service and model status.
Response:
{
"status": "healthy",
"whisper_loaded": true,
"tts_loaded": true
}Convert an audio file to text using OpenAI Whisper.
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body: Upload an audio file via form data
- Supported formats: WAV, MP3, M4A, OGG, FLAC
Response:
{
"text": "Transcribed text from audio"
}Error Responses:
400: Unsupported file type500: Whisper model not loaded or processing error
Convert text to an audio file using VoxCPM-0.5B.
Request:
{
"text": "Text to convert to speech"
}Response:
- Content-Type: audio/wav
- Body: Audio file (WAV format, 16kHz sample rate)
Error Responses:
400: Empty text500: VoxCPM model not loaded or generation error
Health Check:
curl http://localhost:9876/healthAudio to Text:
curl -X POST "http://localhost:9876/audio-to-text" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio_file.wav"Text to Audio:
curl -X POST "http://localhost:9876/text-to-audio" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a test message"}' \
--output generated_audio.wavimport requests
# Health check
response = requests.get('http://localhost:9876/health')
print(response.json())
# Audio to text
with open('audio_file.wav', 'rb') as f:
response = requests.post(
'http://localhost:9876/audio-to-text',
files={'file': f}
)
result = response.json()
print(f"Transcribed text: {result['text']}")
# Text to audio
response = requests.post(
'http://localhost:9876/text-to-audio',
json={'text': 'Hello, world! This is a test of the text-to-speech system.'}
)
if response.status_code == 200:
with open('output.wav', 'wb') as f:
f.write(response.content)
print("Audio file saved as output.wav")// Health check
fetch('http://localhost:9876/health')
.then(response => response.json())
.then(data => console.log(data));
// Audio to text
const formData = new FormData();
formData.append('file', audioFile);
fetch('http://localhost:9876/audio-to-text', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => console.log('Transcribed text:', data.text));
// Text to audio
fetch('http://localhost:9876/text-to-audio', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: 'Hello, world!' })
})
.then(response => response.blob())
.then(blob => {
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'generated_audio.wav';
a.click();
});Run all tests:
pytestRun tests with coverage:
pytest --cov=.Run specific test file:
pytest tests/test_api.py -vRun tests in parallel:
pytest -n autotests/test_api.py- API endpoint integration teststests/test_audio_to_text.py- Audio-to-text functionality teststests/test_text_to_audio.py- Text-to-audio functionality tests
- Start the application:
python main.py- Test health endpoint:
curl http://localhost:9876/health- Test text-to-speech:
curl -X POST http://localhost:9876/text-to-audio \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a test of the text-to-audio endpoint"}' \
-o test_output.wav- Test with sample audio file:
# Test transcription
curl -X 'POST' http://localhost:9876/audio-to-text \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@test_output.wav;type=audio/wav'Format code:
ruff format .Lint code:
ruff check .Fix linting issues automatically:
ruff check . --fixType checking:
mypy .- Make changes to the code
- Run tests:
pytest - Check code quality:
ruff check . - Format code:
ruff format . - Test manually by running the server
- Commit changes
- Write tests first (TDD approach)
- Implement the feature
- Update documentation
- Ensure all tests pass
- Update CLAUDE.md if needed
- Model: OpenAI Whisper (base model)
- Capabilities: Supports multiple languages and audio formats
- Performance: Optimized for accuracy and speed
- Loading time: ~2-3 seconds on first startup
- Model: VoxCPM-0.5B
- Source: openbmb/VoxCPM-0.5B
- Capabilities: High-quality speech synthesis for English and Chinese
- Parameters:
cfg_value=2.0for language model guidanceinference_timesteps=10for quality/speed balance
- Output: 16kHz WAV files
- Loading time: ~10-15 seconds on first startup (downloads model if not cached)
You can configure the application using environment variables:
WHISPER_MODEL: Whisper model size (default: "base", options: "tiny", "small", "medium", "large")TTS_CFG_VALUE: VoxCPM guidance value (default: 2.0)TTS_INFERENCE_STEPS: VoxCPM inference timesteps (default: 10)
Example:
export WHISPER_MODEL=small
export TTS_CFG_VALUE=1.5
python main.pyModels are automatically downloaded and cached on first use:
- Whisper models:
~/.cache/whisper/ - VoxCPM models:
~/.cache/modelscope/or~/.cache/huggingface/
- First startup: Takes 10-15 seconds to download and load models
- Subsequent startups: Takes 2-3 seconds to load cached models
- Memory usage: ~2-4GB RAM depending on model sizes
- GPU acceleration: Automatically uses GPU if available (CUDA/MPS)
- Concurrent requests: FastAPI handles concurrent requests efficiently
- File cleanup: Temporary files are automatically cleaned up
-
Models not loading:
- Check internet connection for initial model download
- Ensure sufficient disk space (~2GB for models)
- Check the logs for specific error messages
-
Out of memory errors:
- Reduce model sizes or use CPU-only mode
- Close other memory-intensive applications
-
Slow performance:
- Ensure GPU acceleration is working
- Consider using smaller models for faster inference
-
Audio format issues:
- Ensure audio files are in supported formats
- Check file corruption if uploads fail
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)Check model status:
curl http://localhost:9876/healthThe API returns appropriate HTTP status codes:
200- Success400- Bad request (invalid file format, empty text, etc.)422- Validation error (missing required fields)500- Internal server error (model not loaded, processing errors)
All error responses include detailed error messages in the detail field.
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Add tests for new functionality
- Ensure all tests pass:
pytest - Ensure code quality:
ruff check . - Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
Apache 2.0
- OpenAI Whisper for speech-to-text
- VoxCPM for text-to-speech
- FastAPI for the web framework