Skip to content

groxaxo/insanely-fast-whisper-api

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insanely Fast Whisper API - Production Ready

GPU Model Memory

OpenAI-compatible Whisper API optimized for GPU 2 with 15% memory limit. Features auto language detection, Flash Attention 2.0, and seamless Open WebUI integration.

✨ Features

  • 🚀 Whisper Large V3 Turbo - 2x faster than regular v3, same accuracy
  • 🌍 Auto Language Detection - 99+ languages supported
  • Flash Attention 2.0 - Optimized GPU inference
  • 🎯 OpenAI Compatible - Drop-in replacement for OpenAI Whisper API
  • 💾 Memory Efficient - Limited to 15% of GPU 2 (3.53 GB)
  • 🔄 Auto-restart - Systemd service with automatic recovery
  • 📊 Production Tested - Verified with Spanish and English audio

🚀 Quick Start

1. Clone and Setup

git clone https://github.com/groxaxo/insanely-fast-whisper-api.git
cd insanely-fast-whisper-api

# Create conda environment
conda create -n whisper-api python=3.10 -y
conda activate whisper-api

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install Flash Attention
pip install wheel ninja packaging
pip install flash-attn==2.5.6 --no-build-isolation

# Install dependencies
pip install -r requirements.txt

2. Start the API

./start_gpu2_limited.sh

The API will be available at http://localhost:8002

3. Install as System Service (Optional)

For automatic startup on boot:

./install_service.sh

📋 Configuration

GPU Settings

  • Device: GPU 2 (via CUDA_VISIBLE_DEVICES=2)
  • Memory Limit: 15% (3.53 GB on RTX 3090)
  • Memory Management: expandable_segments:True
  • Batch Size: 8 (optimized for memory efficiency)

Model Settings

  • Model: openai/whisper-large-v3-turbo
  • Size: 1.62 GB
  • Precision: FP16
  • Optimization: Flash Attention 2.0
  • Chunk Length: 30 seconds

🔌 API Usage

OpenAI-Compatible Endpoint

curl -X POST http://localhost:8002/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo"

With Language Specification

curl -X POST http://localhost:8002/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "language=es"

Response Format

{
  "text": "Transcribed text here..."
}

🌐 Open WebUI Integration

Configuration

Set these environment variables when starting Open WebUI:

STT_ENGINE=openai
STT_OPENAI_API_BASE_URL=http://localhost:8002
STT_OPENAI_API_KEY=dummy
STT_MODEL=whisper-large-v3-turbo

Start Open WebUI

pkill -f "open-webui"

STT_ENGINE=openai \
STT_OPENAI_API_BASE_URL=http://localhost:8002 \
STT_OPENAI_API_KEY=dummy \
STT_MODEL=whisper-large-v3-turbo \
open-webui serve

Docker Configuration

services:
  open-webui:
    environment:
      - STT_ENGINE=openai
      - STT_OPENAI_API_BASE_URL=http://host.docker.internal:8002
      - STT_OPENAI_API_KEY=dummy
      - STT_MODEL=whisper-large-v3-turbo

🧪 Testing

Run Accuracy Tests

python3 test_accuracy.py

Test Endpoint

./test_openwebui_endpoint.sh

Test Results

  • Success Rate: 80% (4/5 files, 1 corrupted)
  • Avg Processing Time: 0.98 seconds
  • Languages: Spanish, English (auto-detected)
  • Transcription Quality: Excellent

See TEST_RESULTS_SUMMARY.md for detailed results.

🔧 System Service Management

Install Service

./install_service.sh

Service Commands

# Check status
sudo systemctl status whisper-api

# Start/Stop/Restart
sudo systemctl start whisper-api
sudo systemctl stop whisper-api
sudo systemctl restart whisper-api

# View logs
sudo journalctl -u whisper-api -f

# Disable autostart
sudo systemctl disable whisper-api

📊 Performance

Metric Value
Model Size 1.62 GB
GPU Memory 3.53 GB (15% of RTX 3090)
Avg Processing Time ~1 second
Processing Speed 0.57 - 2.95 MB/s
Languages Supported 99+
Batch Size 8

🗂️ Project Structure

insanely-fast-whisper-api/
├── app/
│   ├── app.py                      # Main API (modified with OpenAI endpoint)
│   ├── diarization_pipeline.py
│   └── diarize.py
├── start_gpu2_limited.sh           # Startup script
├── whisper-api.service             # Systemd service file
├── install_service.sh              # Service installation script
├── test_accuracy.py                # Accuracy testing script
├── test_openwebui_endpoint.sh      # Endpoint testing script
├── configure_openwebui.sh          # Open WebUI configuration helper
├── README.md                       # This file
├── SETUP_SUMMARY.md                # Initial setup guide
├── OPEN_WEBUI_INTEGRATION.md       # Integration guide
├── TEST_RESULTS_SUMMARY.md         # Test results
├── README_COMPLETE.md              # Complete documentation
├── requirements.txt
└── pyproject.toml

🛠️ Troubleshooting

API Not Starting

# Check if port 8002 is free
ss -tlnp | grep :8002

# Kill existing process
pkill -f "uvicorn app.app:app"

# Restart
./start_gpu2_limited.sh

Out of Memory Errors

The API is configured for 15% memory usage. If you experience OOM errors:

  1. Check GPU memory: nvidia-smi
  2. Ensure no other processes are using GPU 2
  3. The Turbo model + 15% limit should work fine

Transcription Errors

# Check API logs
sudo journalctl -u whisper-api -f

# Or if running manually, check the terminal output

Open WebUI Connection Issues

  1. Verify API is running: curl http://localhost:8002/
  2. Check Open WebUI environment variables
  3. If using Docker, use http://host.docker.internal:8002

📝 Documentation

🤝 Contributing

This is a production-optimized fork with:

  • OpenAI-compatible endpoint for Open WebUI
  • GPU 2 configuration with 15% memory limit
  • Whisper Large V3 Turbo model
  • Systemd service for autostart
  • Comprehensive testing and documentation

📄 License

Same as the original insanely-fast-whisper-api

🙏 Credits

Based on insanely-fast-whisper-api

📞 Support

For issues or questions:

  1. Check the troubleshooting section above
  2. Review the documentation files
  3. Check GPU status: nvidia-smi
  4. View logs: sudo journalctl -u whisper-api -f

Status: ✅ Production Ready
Last Updated: 2025-11-07
Model: Whisper Large V3 Turbo
GPU: NVIDIA GeForce RTX 3090 (GPU 2, 15% memory) An API to transcribe audio with OpenAI's Whisper Large v3! Powered by 🤗 Transformers, Optimum & flash-attn

Features:

  • 🎤 Transcribe audio to text at blazing fast speeds
  • 📖 Fully open source and deployable on any GPU cloud provider
  • 🗣️ Built-in speaker diarization
  • ⚡ Easy to use and Fast API layer
  • 📃 Async background tasks and webhooks
  • 🔥 Optimized for concurrency and parallel processing
  • ✅ Task management, cancel and status endpoints
  • 🔒 Admin authentication for secure API access
  • 🧩 Fully managed API available on JigsawStack

Based on Insanely Fast Whisper CLI project. Check it out if you like to set up this project locally or understand the background of insanely-fast-whisper.

This project is focused on providing a deployable blazing fast whisper API with docker on cloud infrastructure with GPUs for scalable production use cases.

With Fly.io recent GPU service launch, I've set up the fly config file to easily deploy on fly machines! However, you can deploy this on any other VM environment that supports GPUs and docker.

Here are some benchmarks we ran on Nvidia A100 - 80GB and fly.io GPU infra👇

Optimization type Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) ~2 (1 min 38 sec)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + diarization) ~2 (3 min 16 sec)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + fly machine startup) ~2 (1 min 58 sec)
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + diarization + fly machine startup) ~2 (3 min 36 sec)

The estimated startup time for the Fly machine with GPU and loading up the model is around ~20 seconds. The rest of the time is spent on the actual computation.

Docker image

yoeven/insanely-fast-whisper-api:latest

Docker hub: yoeven/insanely-fast-whisper-api

Deploying to Fly

  • Make sure you already have access to Fly GPUs.
  • Clone the project locally and open a terminal in the root
  • Rename the app name in the fly.toml if you like
  • Remove image = 'yoeven/insanely-fast-whisper-api:latest' in fly.toml only if you want to rebuild the image from the Dockerfile

Install fly cli if don't already have it

Only need to run this the first time you launch a new fly app

fly launch
  • Fly will prompt: Would you like to copy its configuration to the new app? (y/N). Yes (y) to copy configuration from the repo.

  • Fly will prompt: Do you want to tweak these settings before proceeding if you have nothing to adjust. Most of the required settings are already configured in the fly.toml file. No n to proceed and deploy.

The first time you deploy it will take some time since the image is huge. Subsequent deploys will be a lot faster.

Run the following if you want to set up speaker diarization or an auth token to secure your API:

fly secrets set ADMIN_KEY=<your_token> HF_TOKEN=<your_hf_key>

Run fly secrets list to check if the secrets exist.

To get the Hugging face token for speaker diarization you need to do the following:

  1. Accept pyannote/segmentation-3.0 user conditions
  2. Accept pyannote/speaker-diarization-3.1 user conditions
  3. Create an access token at hf.co/settings/tokens.

Your API should look something like this:

https://insanely-fast-whisper-api.fly.dev

Run fly logs -a insanely-fast-whisper-api to view logs in real time of your fly machine.

Deploying to other cloud providers

Since this is a dockerized app, you can deploy it to any cloud provider that supports docker and GPUs with a few config tweaks.

Fully managed and scalable API

JigsawStack provides a bunch of powerful APIs for various use cases while keeping costs low. This project is available as a fully managed API here with enhanced cloud scalability for cost efficiency and high uptime. Sign up here for free!

API usage

Authentication

If you had set up the ADMIN_KEY environment secret. You'll need to pass x-admin-api-key in the header with the value of the key you previously set.

Endpoints

Base URL

If deployed on Fly, the base URL should look something like this:

https://{app_name}.fly.dev/{path}

Depending on the cloud provider you deploy to, the base URL will be different.

POST /

Transcribe or translate audio into text

Body params (JSON)
Name value
url (Required) url of audio
task transcribe, translate default: transcribe
language None, en, other languages default: None Auto detects language
batch_size Number of parallel batches you want to compute. Reduce if you face OOMs. default: 64
timestamp chunk, word default: chunk
diarise_audio Diarise the audio clips by speaker. You will need to set hf_token. default:false
webhook Webhook POST call on completion or error. default: None
webhook.url URL to send the webhook
webhook.header Headers to send with the webhook
is_async Run task in background and sends results to webhook URL. true, false default: false
managed_task_id Custom Task ID used to reference ongoing task. default: uuid() v4 will be generated for each transcription task

GET /tasks

Get all active transcription tasks, both async background tasks and ongoing tasks

GET /status/{task_id}

Get the status of a task, completed tasks will be removed from the list which may throw an error

DELETE /cancel/{task_id}

Cancel async background task. Only transcription jobs created with is_async set to true can be cancelled.

Running locally

# clone the repo
$ git clone https://github.com/jigsawstack/insanely-fast-whisper-api.git

# change the working directory
$ cd insanely-fast-whisper-api

# install torch
$ pip3 install torch torchvision torchaudio

# upgrade wheel and install required packages for FlashAttention
$ pip3 install -U wheel && pip install ninja packaging

# install FlashAttention
$ pip3 install flash-attn --no-build-isolation

# generate updated requirements.txt if you want to use other management tools (Optional)
$ poetry export --output requirements.txt

# get the path of python
$ which python3

# setup virtual environment 
$ poetry env use /full/path/to/python

# install the requirements
$ poetry install

# run the app
$ uvicorn app.app:app --reload

Extra

Shutting down fly machine programmatically

Fly machines are charged by the second and might take up to 15mins of idling before it decides to shut it self down. You can shut down the machine when you're done with the API to save costs. You can do this by sending a POST request to the following endpoint:

https://api.machines.dev/v1/apps/<app_name>/machines/<machine_id>/stop

Authorization header:

Authorization Bearer <fly_token>

Lear more here

Acknowledgements

  1. Vaibhav Srivastav for writing a huge chunk of the code and the CLI version of this project.
  2. OpenAI Whisper

JigsawStack

This project is part of JigsawStack - A suite of powerful and developer friendly APIs for various use cases while keeping costs low. Sign up here for free!

About

An API to transcribe audio with OpenAI's Whisper Large v3!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 71.7%
  • Shell 25.3%
  • Dockerfile 3.0%