Speakify - FastAPI Audio Processing Service

A FastAPI application that provides audio processing capabilities for converting audio files to text (speech-to-text) and text to audio files (text-to-speech).

Features

🎵 Audio to Text: Convert audio files to text using OpenAI Whisper
🔊 Text to Audio: Convert text to audio files using VoxCPM-0.5B model
📊 Health Monitoring: Health check endpoint to monitor service status
🚀 Fast API: Built with FastAPI for high performance and automatic API documentation
🧪 Comprehensive Testing: Full test suite with pytest

Supported Audio Formats

WAV
MP3
M4A
OGG
FLAC

Requirements

Python 3.12+
uv package manager

Installation

Clone the repository:

git clone <repository-url>
cd speakify

Install dependencies:

uv sync

Activate the virtual environment:

source .venv/bin/activate

Running the Application

Development Server

uvicorn main:app --reload

Production Server

uvicorn main:app --host 0.0.0.0 --port 9876

Using the provided script

python main.py

The application will be available at:

API: http://localhost:9876
Interactive API docs: http://localhost:9876/docs
Alternative docs: http://localhost:9876/redoc

API Endpoints

GET `/`

Returns basic API information.

Response:

{
  "message": "Speakify API - Audio to Text and Text to Audio"
}

GET `/health`

Health check endpoint that returns service and model status.

Response:

{
  "status": "healthy",
  "whisper_loaded": true,
  "tts_loaded": true
}

POST `/audio-to-text`

Convert an audio file to text using OpenAI Whisper.

Request:

Method: POST
Content-Type: multipart/form-data
Body: Upload an audio file via form data
Supported formats: WAV, MP3, M4A, OGG, FLAC

Response:

{
  "text": "Transcribed text from audio"
}

Error Responses:

400: Unsupported file type
500: Whisper model not loaded or processing error

POST `/text-to-audio`

Convert text to an audio file using VoxCPM-0.5B.

Request:

{
  "text": "Text to convert to speech"
}

Response:

Content-Type: audio/wav
Body: Audio file (WAV format, 16kHz sample rate)

Error Responses:

400: Empty text
500: VoxCPM model not loaded or generation error

Example Usage

Using curl

Health Check:

curl http://localhost:9876/health

Audio to Text:

curl -X POST "http://localhost:9876/audio-to-text" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio_file.wav"

Text to Audio:

curl -X POST "http://localhost:9876/text-to-audio" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test message"}' \
  --output generated_audio.wav

Using Python requests

import requests

# Health check
response = requests.get('http://localhost:9876/health')
print(response.json())

# Audio to text
with open('audio_file.wav', 'rb') as f:
    response = requests.post(
        'http://localhost:9876/audio-to-text',
        files={'file': f}
    )
    result = response.json()
    print(f"Transcribed text: {result['text']}")

# Text to audio
response = requests.post(
    'http://localhost:9876/text-to-audio',
    json={'text': 'Hello, world! This is a test of the text-to-speech system.'}
)
if response.status_code == 200:
    with open('output.wav', 'wb') as f:
        f.write(response.content)
    print("Audio file saved as output.wav")

Using JavaScript/Fetch

// Health check
fetch('http://localhost:9876/health')
  .then(response => response.json())
  .then(data => console.log(data));

// Audio to text
const formData = new FormData();
formData.append('file', audioFile);
fetch('http://localhost:9876/audio-to-text', {
  method: 'POST',
  body: formData
})
.then(response => response.json())
.then(data => console.log('Transcribed text:', data.text));

// Text to audio
fetch('http://localhost:9876/text-to-audio', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'Hello, world!' })
})
.then(response => response.blob())
.then(blob => {
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = 'generated_audio.wav';
  a.click();
});

Testing

Running Tests

Run all tests:

pytest

Run tests with coverage:

pytest --cov=.

Run specific test file:

pytest tests/test_api.py -v

Run tests in parallel:

pytest -n auto

Test Structure

tests/test_api.py - API endpoint integration tests
tests/test_audio_to_text.py - Audio-to-text functionality tests
tests/test_text_to_audio.py - Text-to-audio functionality tests

Manual Testing

Start the application:

python main.py

Test health endpoint:

curl http://localhost:9876/health

Test text-to-speech:

curl -X POST http://localhost:9876/text-to-audio \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test of the text-to-audio endpoint"}' \
  -o test_output.wav

Test with sample audio file:

# Test transcription
curl -X 'POST' http://localhost:9876/audio-to-text \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@test_output.wav;type=audio/wav'

Development

Code Quality

Format code:

ruff format .

Lint code:

ruff check .

Fix linting issues automatically:

ruff check . --fix

Type checking:

mypy .

Development Workflow

Make changes to the code
Run tests: pytest
Check code quality: ruff check .
Format code: ruff format .
Test manually by running the server
Commit changes

Adding New Features

Write tests first (TDD approach)
Implement the feature
Update documentation
Ensure all tests pass
Update CLAUDE.md if needed

Models

Speech-to-Text (STT)

Model: OpenAI Whisper (base model)
Capabilities: Supports multiple languages and audio formats
Performance: Optimized for accuracy and speed
Loading time: ~2-3 seconds on first startup

Text-to-Speech (TTS)

Model: VoxCPM-0.5B
Source: openbmb/VoxCPM-0.5B
Capabilities: High-quality speech synthesis for English and Chinese
Parameters:
- cfg_value=2.0 for language model guidance
- inference_timesteps=10 for quality/speed balance
Output: 16kHz WAV files
Loading time: ~10-15 seconds on first startup (downloads model if not cached)

Configuration

Environment Variables

You can configure the application using environment variables:

WHISPER_MODEL: Whisper model size (default: "base", options: "tiny", "small", "medium", "large")
TTS_CFG_VALUE: VoxCPM guidance value (default: 2.0)
TTS_INFERENCE_STEPS: VoxCPM inference timesteps (default: 10)

Example:

export WHISPER_MODEL=small
export TTS_CFG_VALUE=1.5
python main.py

Model Caching

Models are automatically downloaded and cached on first use:

Whisper models: ~/.cache/whisper/
VoxCPM models: ~/.cache/modelscope/ or ~/.cache/huggingface/

Performance Considerations

First startup: Takes 10-15 seconds to download and load models
Subsequent startups: Takes 2-3 seconds to load cached models
Memory usage: ~2-4GB RAM depending on model sizes
GPU acceleration: Automatically uses GPU if available (CUDA/MPS)
Concurrent requests: FastAPI handles concurrent requests efficiently
File cleanup: Temporary files are automatically cleaned up

Troubleshooting

Common Issues

Models not loading:
- Check internet connection for initial model download
- Ensure sufficient disk space (~2GB for models)
- Check the logs for specific error messages
Out of memory errors:
- Reduce model sizes or use CPU-only mode
- Close other memory-intensive applications
Slow performance:
- Ensure GPU acceleration is working
- Consider using smaller models for faster inference
Audio format issues:
- Ensure audio files are in supported formats
- Check file corruption if uploads fail

Debugging

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check model status:

curl http://localhost:9876/health

Error Handling

The API returns appropriate HTTP status codes:

200 - Success
400 - Bad request (invalid file format, empty text, etc.)
422 - Validation error (missing required fields)
500 - Internal server error (model not loaded, processing errors)

All error responses include detailed error messages in the detail field.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes
Add tests for new functionality
Ensure all tests pass: pytest
Ensure code quality: ruff check .
Commit your changes: git commit -am 'Add feature'
Push to the branch: git push origin feature-name
Submit a pull request

License

Apache 2.0

Acknowledgments

OpenAI Whisper for speech-to-text
VoxCPM for text-to-speech
FastAPI for the web framework

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
reference_speaker.wav		reference_speaker.wav
uv.lock		uv.lock

williamcaban/speakify

Folders and files

Latest commit

History

Repository files navigation

Speakify - FastAPI Audio Processing Service

Features

Supported Audio Formats

Requirements

Installation

Running the Application

Development Server

Production Server

Using the provided script

API Endpoints

GET /

GET /health

POST /audio-to-text

POST /text-to-audio

Example Usage

Using curl

Using Python requests

Using JavaScript/Fetch

Testing

Running Tests

Test Structure

Manual Testing

Development

Code Quality

Development Workflow

Adding New Features

Models

Speech-to-Text (STT)

Text-to-Speech (TTS)

Configuration

Environment Variables

Model Caching

Performance Considerations

Troubleshooting

Common Issues

Debugging

Error Handling

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

GET `/`

GET `/health`

POST `/audio-to-text`

POST `/text-to-audio`

Packages