OpenAI-compatible Whisper API optimized for GPU 2 with 15% memory limit. Features auto language detection, Flash Attention 2.0, and seamless Open WebUI integration.
- 🚀 Whisper Large V3 Turbo - 2x faster than regular v3, same accuracy
- 🌍 Auto Language Detection - 99+ languages supported
- ⚡ Flash Attention 2.0 - Optimized GPU inference
- 🎯 OpenAI Compatible - Drop-in replacement for OpenAI Whisper API
- 💾 Memory Efficient - Limited to 15% of GPU 2 (3.53 GB)
- 🔄 Auto-restart - Systemd service with automatic recovery
- 📊 Production Tested - Verified with Spanish and English audio
git clone https://github.com/groxaxo/insanely-fast-whisper-api.git
cd insanely-fast-whisper-api
# Create conda environment
conda create -n whisper-api python=3.10 -y
conda activate whisper-api
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install Flash Attention
pip install wheel ninja packaging
pip install flash-attn==2.5.6 --no-build-isolation
# Install dependencies
pip install -r requirements.txt./start_gpu2_limited.shThe API will be available at http://localhost:8002
For automatic startup on boot:
./install_service.sh- Device: GPU 2 (via
CUDA_VISIBLE_DEVICES=2) - Memory Limit: 15% (3.53 GB on RTX 3090)
- Memory Management:
expandable_segments:True - Batch Size: 8 (optimized for memory efficiency)
- Model:
openai/whisper-large-v3-turbo - Size: 1.62 GB
- Precision: FP16
- Optimization: Flash Attention 2.0
- Chunk Length: 30 seconds
curl -X POST http://localhost:8002/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo"curl -X POST http://localhost:8002/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" \
-F "language=es"{
"text": "Transcribed text here..."
}Set these environment variables when starting Open WebUI:
STT_ENGINE=openai
STT_OPENAI_API_BASE_URL=http://localhost:8002
STT_OPENAI_API_KEY=dummy
STT_MODEL=whisper-large-v3-turbopkill -f "open-webui"
STT_ENGINE=openai \
STT_OPENAI_API_BASE_URL=http://localhost:8002 \
STT_OPENAI_API_KEY=dummy \
STT_MODEL=whisper-large-v3-turbo \
open-webui serveservices:
open-webui:
environment:
- STT_ENGINE=openai
- STT_OPENAI_API_BASE_URL=http://host.docker.internal:8002
- STT_OPENAI_API_KEY=dummy
- STT_MODEL=whisper-large-v3-turbopython3 test_accuracy.py./test_openwebui_endpoint.sh- ✅ Success Rate: 80% (4/5 files, 1 corrupted)
- ✅ Avg Processing Time: 0.98 seconds
- ✅ Languages: Spanish, English (auto-detected)
- ✅ Transcription Quality: Excellent
See TEST_RESULTS_SUMMARY.md for detailed results.
./install_service.sh# Check status
sudo systemctl status whisper-api
# Start/Stop/Restart
sudo systemctl start whisper-api
sudo systemctl stop whisper-api
sudo systemctl restart whisper-api
# View logs
sudo journalctl -u whisper-api -f
# Disable autostart
sudo systemctl disable whisper-api| Metric | Value |
|---|---|
| Model Size | 1.62 GB |
| GPU Memory | 3.53 GB (15% of RTX 3090) |
| Avg Processing Time | ~1 second |
| Processing Speed | 0.57 - 2.95 MB/s |
| Languages Supported | 99+ |
| Batch Size | 8 |
insanely-fast-whisper-api/
├── app/
│ ├── app.py # Main API (modified with OpenAI endpoint)
│ ├── diarization_pipeline.py
│ └── diarize.py
├── start_gpu2_limited.sh # Startup script
├── whisper-api.service # Systemd service file
├── install_service.sh # Service installation script
├── test_accuracy.py # Accuracy testing script
├── test_openwebui_endpoint.sh # Endpoint testing script
├── configure_openwebui.sh # Open WebUI configuration helper
├── README.md # This file
├── SETUP_SUMMARY.md # Initial setup guide
├── OPEN_WEBUI_INTEGRATION.md # Integration guide
├── TEST_RESULTS_SUMMARY.md # Test results
├── README_COMPLETE.md # Complete documentation
├── requirements.txt
└── pyproject.toml
# Check if port 8002 is free
ss -tlnp | grep :8002
# Kill existing process
pkill -f "uvicorn app.app:app"
# Restart
./start_gpu2_limited.shThe API is configured for 15% memory usage. If you experience OOM errors:
- Check GPU memory:
nvidia-smi - Ensure no other processes are using GPU 2
- The Turbo model + 15% limit should work fine
# Check API logs
sudo journalctl -u whisper-api -f
# Or if running manually, check the terminal output- Verify API is running:
curl http://localhost:8002/ - Check Open WebUI environment variables
- If using Docker, use
http://host.docker.internal:8002
- SETUP_SUMMARY.md - Initial setup guide
- OPEN_WEBUI_INTEGRATION.md - Open WebUI integration
- TEST_RESULTS_SUMMARY.md - Test results and accuracy
- README_COMPLETE.md - Complete documentation
This is a production-optimized fork with:
- OpenAI-compatible endpoint for Open WebUI
- GPU 2 configuration with 15% memory limit
- Whisper Large V3 Turbo model
- Systemd service for autostart
- Comprehensive testing and documentation
Same as the original insanely-fast-whisper-api
Based on insanely-fast-whisper-api
For issues or questions:
- Check the troubleshooting section above
- Review the documentation files
- Check GPU status:
nvidia-smi - View logs:
sudo journalctl -u whisper-api -f
Status: ✅ Production Ready
Last Updated: 2025-11-07
Model: Whisper Large V3 Turbo
GPU: NVIDIA GeForce RTX 3090 (GPU 2, 15% memory)
An API to transcribe audio with OpenAI's Whisper Large v3! Powered by 🤗 Transformers, Optimum & flash-attn
Features:
- 🎤 Transcribe audio to text at blazing fast speeds
- 📖 Fully open source and deployable on any GPU cloud provider
- 🗣️ Built-in speaker diarization
- ⚡ Easy to use and Fast API layer
- 📃 Async background tasks and webhooks
- 🔥 Optimized for concurrency and parallel processing
- ✅ Task management, cancel and status endpoints
- 🔒 Admin authentication for secure API access
- 🧩 Fully managed API available on JigsawStack
Based on Insanely Fast Whisper CLI project. Check it out if you like to set up this project locally or understand the background of insanely-fast-whisper.
This project is focused on providing a deployable blazing fast whisper API with docker on cloud infrastructure with GPUs for scalable production use cases.
With Fly.io recent GPU service launch, I've set up the fly config file to easily deploy on fly machines! However, you can deploy this on any other VM environment that supports GPUs and docker.
Here are some benchmarks we ran on Nvidia A100 - 80GB and fly.io GPU infra👇
| Optimization type | Time to Transcribe (150 mins of Audio) |
|---|---|
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2) |
~2 (1 min 38 sec) |
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + diarization) |
~2 (3 min 16 sec) |
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + fly machine startup) |
~2 (1 min 58 sec) |
large-v3 (Transformers) (fp16 + batching [24] + Flash Attention 2 + diarization + fly machine startup) |
~2 (3 min 36 sec) |
The estimated startup time for the Fly machine with GPU and loading up the model is around ~20 seconds. The rest of the time is spent on the actual computation.
yoeven/insanely-fast-whisper-api:latest
Docker hub: yoeven/insanely-fast-whisper-api
- Make sure you already have access to Fly GPUs.
- Clone the project locally and open a terminal in the root
- Rename the
appname in thefly.tomlif you like - Remove
image = 'yoeven/insanely-fast-whisper-api:latest'infly.tomlonly if you want to rebuild the image from theDockerfile
Install fly cli if don't already have it
Only need to run this the first time you launch a new fly app
fly launch-
Fly will prompt:
Would you like to copy its configuration to the new app? (y/N). Yes (y) to copy configuration from the repo. -
Fly will prompt:
Do you want to tweak these settings before proceedingif you have nothing to adjust. Most of the required settings are already configured in thefly.tomlfile. Nonto proceed and deploy.
The first time you deploy it will take some time since the image is huge. Subsequent deploys will be a lot faster.
Run the following if you want to set up speaker diarization or an auth token to secure your API:
fly secrets set ADMIN_KEY=<your_token> HF_TOKEN=<your_hf_key>Run fly secrets list to check if the secrets exist.
To get the Hugging face token for speaker diarization you need to do the following:
- Accept
pyannote/segmentation-3.0user conditions - Accept
pyannote/speaker-diarization-3.1user conditions - Create an access token at
hf.co/settings/tokens.
Your API should look something like this:
https://insanely-fast-whisper-api.fly.dev
Run fly logs -a insanely-fast-whisper-api to view logs in real time of your fly machine.
Since this is a dockerized app, you can deploy it to any cloud provider that supports docker and GPUs with a few config tweaks.
JigsawStack provides a bunch of powerful APIs for various use cases while keeping costs low. This project is available as a fully managed API here with enhanced cloud scalability for cost efficiency and high uptime. Sign up here for free!
If you had set up the ADMIN_KEY environment secret. You'll need to pass x-admin-api-key in the header with the value of the key you previously set.
If deployed on Fly, the base URL should look something like this:
https://{app_name}.fly.dev/{path}
Depending on the cloud provider you deploy to, the base URL will be different.
Transcribe or translate audio into text
| Name | value |
|---|---|
| url (Required) | url of audio |
| task | transcribe, translate default: transcribe |
| language | None, en, other languages default: None Auto detects language |
| batch_size | Number of parallel batches you want to compute. Reduce if you face OOMs. default: 64 |
| timestamp | chunk, word default: chunk |
| diarise_audio | Diarise the audio clips by speaker. You will need to set hf_token. default:false |
| webhook | Webhook POST call on completion or error. default: None |
| webhook.url | URL to send the webhook |
| webhook.header | Headers to send with the webhook |
| is_async | Run task in background and sends results to webhook URL. true, false default: false |
| managed_task_id | Custom Task ID used to reference ongoing task. default: uuid() v4 will be generated for each transcription task |
Get all active transcription tasks, both async background tasks and ongoing tasks
Get the status of a task, completed tasks will be removed from the list which may throw an error
Cancel async background task. Only transcription jobs created with is_async set to true can be cancelled.
# clone the repo
$ git clone https://github.com/jigsawstack/insanely-fast-whisper-api.git
# change the working directory
$ cd insanely-fast-whisper-api
# install torch
$ pip3 install torch torchvision torchaudio
# upgrade wheel and install required packages for FlashAttention
$ pip3 install -U wheel && pip install ninja packaging
# install FlashAttention
$ pip3 install flash-attn --no-build-isolation
# generate updated requirements.txt if you want to use other management tools (Optional)
$ poetry export --output requirements.txt
# get the path of python
$ which python3
# setup virtual environment
$ poetry env use /full/path/to/python
# install the requirements
$ poetry install
# run the app
$ uvicorn app.app:app --reloadFly machines are charged by the second and might take up to 15mins of idling before it decides to shut it self down. You can shut down the machine when you're done with the API to save costs. You can do this by sending a POST request to the following endpoint:
https://api.machines.dev/v1/apps/<app_name>/machines/<machine_id>/stop
Authorization header:
Authorization Bearer <fly_token>
Lear more here
- Vaibhav Srivastav for writing a huge chunk of the code and the CLI version of this project.
- OpenAI Whisper
This project is part of JigsawStack - A suite of powerful and developer friendly APIs for various use cases while keeping costs low. Sign up here for free!