Skip to content

A FastAPI application that provides audio processing capabilities for converting audio files to text (speech-to-text) and text to audio files (text-to-speech).

Notifications You must be signed in to change notification settings

williamcaban/speakify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speakify - FastAPI Audio Processing Service

A FastAPI application that provides audio processing capabilities for converting audio files to text (speech-to-text) and text to audio files (text-to-speech).

Features

  • 🎵 Audio to Text: Convert audio files to text using OpenAI Whisper
  • 🔊 Text to Audio: Convert text to audio files using VoxCPM-0.5B model
  • 📊 Health Monitoring: Health check endpoint to monitor service status
  • 🚀 Fast API: Built with FastAPI for high performance and automatic API documentation
  • đź§Ş Comprehensive Testing: Full test suite with pytest

Supported Audio Formats

  • WAV
  • MP3
  • M4A
  • OGG
  • FLAC

Requirements

  • Python 3.12+
  • uv package manager

Installation

  1. Clone the repository:
git clone <repository-url>
cd speakify
  1. Install dependencies:
uv sync
  1. Activate the virtual environment:
source .venv/bin/activate

Running the Application

Development Server

uvicorn main:app --reload

Production Server

uvicorn main:app --host 0.0.0.0 --port 9876

Using the provided script

python main.py

The application will be available at:

API Endpoints

GET /

Returns basic API information.

Response:

{
  "message": "Speakify API - Audio to Text and Text to Audio"
}

GET /health

Health check endpoint that returns service and model status.

Response:

{
  "status": "healthy",
  "whisper_loaded": true,
  "tts_loaded": true
}

POST /audio-to-text

Convert an audio file to text using OpenAI Whisper.

Request:

  • Method: POST
  • Content-Type: multipart/form-data
  • Body: Upload an audio file via form data
  • Supported formats: WAV, MP3, M4A, OGG, FLAC

Response:

{
  "text": "Transcribed text from audio"
}

Error Responses:

  • 400: Unsupported file type
  • 500: Whisper model not loaded or processing error

POST /text-to-audio

Convert text to an audio file using VoxCPM-0.5B.

Request:

{
  "text": "Text to convert to speech"
}

Response:

  • Content-Type: audio/wav
  • Body: Audio file (WAV format, 16kHz sample rate)

Error Responses:

  • 400: Empty text
  • 500: VoxCPM model not loaded or generation error

Example Usage

Using curl

Health Check:

curl http://localhost:9876/health

Audio to Text:

curl -X POST "http://localhost:9876/audio-to-text" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio_file.wav"

Text to Audio:

curl -X POST "http://localhost:9876/text-to-audio" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test message"}' \
  --output generated_audio.wav

Using Python requests

import requests

# Health check
response = requests.get('http://localhost:9876/health')
print(response.json())

# Audio to text
with open('audio_file.wav', 'rb') as f:
    response = requests.post(
        'http://localhost:9876/audio-to-text',
        files={'file': f}
    )
    result = response.json()
    print(f"Transcribed text: {result['text']}")

# Text to audio
response = requests.post(
    'http://localhost:9876/text-to-audio',
    json={'text': 'Hello, world! This is a test of the text-to-speech system.'}
)
if response.status_code == 200:
    with open('output.wav', 'wb') as f:
        f.write(response.content)
    print("Audio file saved as output.wav")

Using JavaScript/Fetch

// Health check
fetch('http://localhost:9876/health')
  .then(response => response.json())
  .then(data => console.log(data));

// Audio to text
const formData = new FormData();
formData.append('file', audioFile);
fetch('http://localhost:9876/audio-to-text', {
  method: 'POST',
  body: formData
})
.then(response => response.json())
.then(data => console.log('Transcribed text:', data.text));

// Text to audio
fetch('http://localhost:9876/text-to-audio', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ text: 'Hello, world!' })
})
.then(response => response.blob())
.then(blob => {
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = 'generated_audio.wav';
  a.click();
});

Testing

Running Tests

Run all tests:

pytest

Run tests with coverage:

pytest --cov=.

Run specific test file:

pytest tests/test_api.py -v

Run tests in parallel:

pytest -n auto

Test Structure

  • tests/test_api.py - API endpoint integration tests
  • tests/test_audio_to_text.py - Audio-to-text functionality tests
  • tests/test_text_to_audio.py - Text-to-audio functionality tests

Manual Testing

  1. Start the application:
python main.py
  1. Test health endpoint:
curl http://localhost:9876/health
  1. Test text-to-speech:
curl -X POST http://localhost:9876/text-to-audio \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test of the text-to-audio endpoint"}' \
  -o test_output.wav
  1. Test with sample audio file:
# Test transcription
curl -X 'POST' http://localhost:9876/audio-to-text \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@test_output.wav;type=audio/wav'

Development

Code Quality

Format code:

ruff format .

Lint code:

ruff check .

Fix linting issues automatically:

ruff check . --fix

Type checking:

mypy .

Development Workflow

  1. Make changes to the code
  2. Run tests: pytest
  3. Check code quality: ruff check .
  4. Format code: ruff format .
  5. Test manually by running the server
  6. Commit changes

Adding New Features

  1. Write tests first (TDD approach)
  2. Implement the feature
  3. Update documentation
  4. Ensure all tests pass
  5. Update CLAUDE.md if needed

Models

Speech-to-Text (STT)

  • Model: OpenAI Whisper (base model)
  • Capabilities: Supports multiple languages and audio formats
  • Performance: Optimized for accuracy and speed
  • Loading time: ~2-3 seconds on first startup

Text-to-Speech (TTS)

  • Model: VoxCPM-0.5B
  • Source: openbmb/VoxCPM-0.5B
  • Capabilities: High-quality speech synthesis for English and Chinese
  • Parameters:
    • cfg_value=2.0 for language model guidance
    • inference_timesteps=10 for quality/speed balance
  • Output: 16kHz WAV files
  • Loading time: ~10-15 seconds on first startup (downloads model if not cached)

Configuration

Environment Variables

You can configure the application using environment variables:

  • WHISPER_MODEL: Whisper model size (default: "base", options: "tiny", "small", "medium", "large")
  • TTS_CFG_VALUE: VoxCPM guidance value (default: 2.0)
  • TTS_INFERENCE_STEPS: VoxCPM inference timesteps (default: 10)

Example:

export WHISPER_MODEL=small
export TTS_CFG_VALUE=1.5
python main.py

Model Caching

Models are automatically downloaded and cached on first use:

  • Whisper models: ~/.cache/whisper/
  • VoxCPM models: ~/.cache/modelscope/ or ~/.cache/huggingface/

Performance Considerations

  • First startup: Takes 10-15 seconds to download and load models
  • Subsequent startups: Takes 2-3 seconds to load cached models
  • Memory usage: ~2-4GB RAM depending on model sizes
  • GPU acceleration: Automatically uses GPU if available (CUDA/MPS)
  • Concurrent requests: FastAPI handles concurrent requests efficiently
  • File cleanup: Temporary files are automatically cleaned up

Troubleshooting

Common Issues

  1. Models not loading:

    • Check internet connection for initial model download
    • Ensure sufficient disk space (~2GB for models)
    • Check the logs for specific error messages
  2. Out of memory errors:

    • Reduce model sizes or use CPU-only mode
    • Close other memory-intensive applications
  3. Slow performance:

    • Ensure GPU acceleration is working
    • Consider using smaller models for faster inference
  4. Audio format issues:

    • Ensure audio files are in supported formats
    • Check file corruption if uploads fail

Debugging

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check model status:

curl http://localhost:9876/health

Error Handling

The API returns appropriate HTTP status codes:

  • 200 - Success
  • 400 - Bad request (invalid file format, empty text, etc.)
  • 422 - Validation error (missing required fields)
  • 500 - Internal server error (model not loaded, processing errors)

All error responses include detailed error messages in the detail field.

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass: pytest
  6. Ensure code quality: ruff check .
  7. Commit your changes: git commit -am 'Add feature'
  8. Push to the branch: git push origin feature-name
  9. Submit a pull request

License

Apache 2.0

Acknowledgments

About

A FastAPI application that provides audio processing capabilities for converting audio files to text (speech-to-text) and text to audio files (text-to-speech).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages