Insanely Fast Whisper API - Production Ready

OpenAI-compatible Whisper API optimized for GPU 2 with 15% memory limit. Features auto language detection, Flash Attention 2.0, and seamless Open WebUI integration.

✨ Features

🚀 Whisper Large V3 Turbo - 2x faster than regular v3, same accuracy
🌍 Auto Language Detection - 99+ languages supported
⚡ Flash Attention 2.0 - Optimized GPU inference
🎯 OpenAI Compatible - Drop-in replacement for OpenAI Whisper API
💾 Memory Efficient - Limited to 15% of GPU 2 (3.53 GB)
🔄 Auto-restart - Systemd service with automatic recovery
📊 Production Tested - Verified with Spanish and English audio

🚀 Quick Start

1. Clone and Setup

git clone https://github.com/groxaxo/insanely-fast-whisper-api.git
cd insanely-fast-whisper-api

# Create conda environment
conda create -n whisper-api python=3.10 -y
conda activate whisper-api

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install Flash Attention
pip install wheel ninja packaging
pip install flash-attn==2.5.6 --no-build-isolation

# Install dependencies
pip install -r requirements.txt

2. Start the API

./start_gpu2_limited.sh

The API will be available at http://localhost:8002

3. Install as System Service (Optional)

For automatic startup on boot:

./install_service.sh

📋 Configuration

GPU Settings

Device: GPU 2 (via CUDA_VISIBLE_DEVICES=2)
Memory Limit: 15% (3.53 GB on RTX 3090)
Memory Management: expandable_segments:True
Batch Size: 8 (optimized for memory efficiency)

Model Settings

Model: openai/whisper-large-v3-turbo
Size: 1.62 GB
Precision: FP16
Optimization: Flash Attention 2.0
Chunk Length: 30 seconds

🔌 API Usage

OpenAI-Compatible Endpoint

curl -X POST http://localhost:8002/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo"

With Language Specification

curl -X POST http://localhost:8002/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "language=es"

Response Format

{
  "text": "Transcribed text here..."
}

🌐 Open WebUI Integration

Configuration

Set these environment variables when starting Open WebUI:

STT_ENGINE=openai
STT_OPENAI_API_BASE_URL=http://localhost:8002
STT_OPENAI_API_KEY=dummy
STT_MODEL=whisper-large-v3-turbo

Start Open WebUI

pkill -f "open-webui"

STT_ENGINE=openai \
STT_OPENAI_API_BASE_URL=http://localhost:8002 \
STT_OPENAI_API_KEY=dummy \
STT_MODEL=whisper-large-v3-turbo \
open-webui serve

Docker Configuration

services:
  open-webui:
    environment:
      - STT_ENGINE=openai
      - STT_OPENAI_API_BASE_URL=http://host.docker.internal:8002
      - STT_OPENAI_API_KEY=dummy
      - STT_MODEL=whisper-large-v3-turbo

🧪 Testing

Run Accuracy Tests

python3 test_accuracy.py

Test Endpoint

./test_openwebui_endpoint.sh

Test Results

✅ Success Rate: 80% (4/5 files, 1 corrupted)
✅ Avg Processing Time: 0.98 seconds
✅ Languages: Spanish, English (auto-detected)
✅ Transcription Quality: Excellent

See TEST_RESULTS_SUMMARY.md for detailed results.

🔧 System Service Management

Install Service

./install_service.sh

Service Commands

# Check status
sudo systemctl status whisper-api

# Start/Stop/Restart
sudo systemctl start whisper-api
sudo systemctl stop whisper-api
sudo systemctl restart whisper-api

# View logs
sudo journalctl -u whisper-api -f

# Disable autostart
sudo systemctl disable whisper-api

📊 Performance

Metric	Value
Model Size	1.62 GB
GPU Memory	3.53 GB (15% of RTX 3090)
Avg Processing Time	~1 second
Processing Speed	0.57 - 2.95 MB/s
Languages Supported	99+
Batch Size	8

🗂️ Project Structure

insanely-fast-whisper-api/
├── app/
│   ├── app.py                      # Main API (modified with OpenAI endpoint)
│   ├── diarization_pipeline.py
│   └── diarize.py
├── start_gpu2_limited.sh           # Startup script
├── whisper-api.service             # Systemd service file
├── install_service.sh              # Service installation script
├── test_accuracy.py                # Accuracy testing script
├── test_openwebui_endpoint.sh      # Endpoint testing script
├── configure_openwebui.sh          # Open WebUI configuration helper
├── README.md                       # This file
├── SETUP_SUMMARY.md                # Initial setup guide
├── OPEN_WEBUI_INTEGRATION.md       # Integration guide
├── TEST_RESULTS_SUMMARY.md         # Test results
├── README_COMPLETE.md              # Complete documentation
├── requirements.txt
└── pyproject.toml

🛠️ Troubleshooting

API Not Starting

# Check if port 8002 is free
ss -tlnp | grep :8002

# Kill existing process
pkill -f "uvicorn app.app:app"

# Restart
./start_gpu2_limited.sh

Out of Memory Errors

The API is configured for 15% memory usage. If you experience OOM errors:

Check GPU memory: nvidia-smi
Ensure no other processes are using GPU 2
The Turbo model + 15% limit should work fine

Transcription Errors

# Check API logs
sudo journalctl -u whisper-api -f

# Or if running manually, check the terminal output

Open WebUI Connection Issues

Verify API is running: curl http://localhost:8002/
Check Open WebUI environment variables
If using Docker, use http://host.docker.internal:8002

📝 Documentation

SETUP_SUMMARY.md - Initial setup guide
OPEN_WEBUI_INTEGRATION.md - Open WebUI integration
TEST_RESULTS_SUMMARY.md - Test results and accuracy
README_COMPLETE.md - Complete documentation

🤝 Contributing

This is a production-optimized fork with:

OpenAI-compatible endpoint for Open WebUI
GPU 2 configuration with 15% memory limit
Whisper Large V3 Turbo model
Systemd service for autostart
Comprehensive testing and documentation

📄 License

Same as the original insanely-fast-whisper-api

🙏 Credits

Based on insanely-fast-whisper-api

📞 Support

For issues or questions:

Check the troubleshooting section above
Review the documentation files
Check GPU status: nvidia-smi
View logs: sudo journalctl -u whisper-api -f

Status: ✅ Production Ready
Last Updated: 2025-11-07
Model: Whisper Large V3 Turbo
GPU: NVIDIA GeForce RTX 3090 (GPU 2, 15% memory) An API to transcribe audio with OpenAI's Whisper Large v3! Powered by 🤗 Transformers, Optimum & flash-attn

Features:

🎤 Transcribe audio to text at blazing fast speeds
📖 Fully open source and deployable on any GPU cloud provider
🗣️ Built-in speaker diarization
⚡ Easy to use and Fast API layer
📃 Async background tasks and webhooks
🔥 Optimized for concurrency and parallel processing
✅ Task management, cancel and status endpoints
🔒 Admin authentication for secure API access
🧩 Fully managed API available on JigsawStack

Based on Insanely Fast Whisper CLI project. Check it out if you like to set up this project locally or understand the background of insanely-fast-whisper.

This project is focused on providing a deployable blazing fast whisper API with docker on cloud infrastructure with GPUs for scalable production use cases.

With Fly.io recent GPU service launch, I've set up the fly config file to easily deploy on fly machines! However, you can deploy this on any other VM environment that supports GPUs and docker.

Here are some benchmarks we ran on Nvidia A100 - 80GB and fly.io GPU infra👇

Optimization type	Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2`)	*~2 (1 min 38 sec)*
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2` + `diarization`)	*~2 (3 min 16 sec)*
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2` + `fly machine startup`)	*~2 (1 min 58 sec)*
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2` + `diarization + fly machine startup`)	*~2 (3 min 36 sec)*

The estimated startup time for the Fly machine with GPU and loading up the model is around ~20 seconds. The rest of the time is spent on the actual computation.

Docker image

yoeven/insanely-fast-whisper-api:latest

Docker hub: yoeven/insanely-fast-whisper-api

Deploying to Fly

Make sure you already have access to Fly GPUs.
Clone the project locally and open a terminal in the root
Rename the app name in the fly.toml if you like
Remove image = 'yoeven/insanely-fast-whisper-api:latest' in fly.toml only if you want to rebuild the image from the Dockerfile

Install fly cli if don't already have it

Only need to run this the first time you launch a new fly app

fly launch

Fly will prompt: Would you like to copy its configuration to the new app? (y/N). Yes (y) to copy configuration from the repo.
Fly will prompt: Do you want to tweak these settings before proceeding if you have nothing to adjust. Most of the required settings are already configured in the fly.toml file. No n to proceed and deploy.

The first time you deploy it will take some time since the image is huge. Subsequent deploys will be a lot faster.

Run the following if you want to set up speaker diarization or an auth token to secure your API:

fly secrets set ADMIN_KEY=<your_token> HF_TOKEN=<your_hf_key>

Run fly secrets list to check if the secrets exist.

To get the Hugging face token for speaker diarization you need to do the following:

Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create an access token at hf.co/settings/tokens.

Your API should look something like this:

https://insanely-fast-whisper-api.fly.dev

Run fly logs -a insanely-fast-whisper-api to view logs in real time of your fly machine.

Deploying to other cloud providers

Since this is a dockerized app, you can deploy it to any cloud provider that supports docker and GPUs with a few config tweaks.

Fully managed and scalable API

JigsawStack provides a bunch of powerful APIs for various use cases while keeping costs low. This project is available as a fully managed API here with enhanced cloud scalability for cost efficiency and high uptime. Sign up here for free!

API usage

Authentication

If you had set up the ADMIN_KEY environment secret. You'll need to pass x-admin-api-key in the header with the value of the key you previously set.

Endpoints

Base URL

If deployed on Fly, the base URL should look something like this:

https://{app_name}.fly.dev/{path}

Depending on the cloud provider you deploy to, the base URL will be different.

POST `/`

Transcribe or translate audio into text

Body params (JSON)

Name	value
url (Required)	url of audio
task	`transcribe`, `translate` default: `transcribe`
language	`None`, `en`, other languages default: `None` Auto detects language
batch_size	Number of parallel batches you want to compute. Reduce if you face OOMs. default: `64`
timestamp	`chunk`, `word` default: `chunk`
diarise_audio	Diarise the audio clips by speaker. You will need to set hf_token. default:`false`
webhook	Webhook `POST` call on completion or error. default: `None`
webhook.url	URL to send the webhook
webhook.header	Headers to send with the webhook
is_async	Run task in background and sends results to webhook URL. `true`, `false` default: `false`
managed_task_id	Custom Task ID used to reference ongoing task. default: `uuid() v4 will be generated for each transcription task`

GET `/tasks`

Get all active transcription tasks, both async background tasks and ongoing tasks

GET `/status/{task_id}`

Get the status of a task, completed tasks will be removed from the list which may throw an error

DELETE `/cancel/{task_id}`

Cancel async background task. Only transcription jobs created with is_async set to true can be cancelled.

Running locally

# clone the repo
$ git clone https://github.com/jigsawstack/insanely-fast-whisper-api.git

# change the working directory
$ cd insanely-fast-whisper-api

# install torch
$ pip3 install torch torchvision torchaudio

# upgrade wheel and install required packages for FlashAttention
$ pip3 install -U wheel && pip install ninja packaging

# install FlashAttention
$ pip3 install flash-attn --no-build-isolation

# generate updated requirements.txt if you want to use other management tools (Optional)
$ poetry export --output requirements.txt

# get the path of python
$ which python3

# setup virtual environment 
$ poetry env use /full/path/to/python

# install the requirements
$ poetry install

# run the app
$ uvicorn app.app:app --reload

Extra

Shutting down fly machine programmatically

Fly machines are charged by the second and might take up to 15mins of idling before it decides to shut it self down. You can shut down the machine when you're done with the API to save costs. You can do this by sending a POST request to the following endpoint:

https://api.machines.dev/v1/apps/<app_name>/machines/<machine_id>/stop

Authorization header:

Authorization Bearer <fly_token>

Lear more here

Acknowledgements

Vaibhav Srivastav for writing a huge chunk of the code and the CLI version of this project.
OpenAI Whisper

JigsawStack

This project is part of JigsawStack - A suite of powerful and developer friendly APIs for various use cases while keeping costs low. Sign up here for free!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
DEPLOYMENT_COMPLETE.md		DEPLOYMENT_COMPLETE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
OPEN_WEBUI_INTEGRATION.md		OPEN_WEBUI_INTEGRATION.md
README.md		README.md
README_COMPLETE.md		README_COMPLETE.md
SETUP_SUMMARY.md		SETUP_SUMMARY.md
TEST_RESULTS_SUMMARY.md		TEST_RESULTS_SUMMARY.md
configure_openwebui.sh		configure_openwebui.sh
fly.toml		fly.toml
install_service.sh		install_service.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start_gpu2_limited.sh		start_gpu2_limited.sh
test_accuracy.py		test_accuracy.py
test_openwebui_endpoint.sh		test_openwebui_endpoint.sh
whisper-api.service		whisper-api.service

License

groxaxo/insanely-fast-whisper-api

Folders and files

Latest commit

History

Repository files navigation