GPU-Accelerated Local LLMs for Everyone (Vulkan + Ilm โ "knowledge")
VulkanIlm is a Python-first wrapper and CLI around llama.cpp's Vulkan backend that brings fast local LLM inference to AMD, Intel, and NVIDIA GPUs โ no CUDA required. Built for developers with legacy or non-NVIDIA hardware.
- What: Python library + CLI to run LLMs locally using Vulkan GPU acceleration.
- Why: Most acceleration tooling targets CUDA/NVIDIA โ VulkanIlm opens up AMD & Intel users.
- Quick result: Small models can run orders of magnitude faster on iGPUs; mid/large legacy GPUs get ~4โ6ร speedups vs CPU.
- ๐ Significant speedups vs CPU on legacy GPUs and iGPUs
- ๐ฎ Broad GPU support: AMD, Intel, NVIDIA (via Vulkan)
- ๐ Python-first API + easy CLI tools
- โก Auto detection + GPU-specific optimizations
- ๐ฆ Auto build/install of
llama.cppVulkan backend - ๐ Real-time streaming token generation
- โ
Reproducible benchmark scripts in
benchmarks/
Benchmarks measured with Gemma-3n-E4B-it (6.9B) unless noted. Results depend on model quantization, GPU drivers, OS, and system load.
| Hardware (OS) | Model | CPU time | Vulkan (GPU) time | Speedup |
|---|---|---|---|---|
| Dell E7250 (i7-5600U, integrated GPU) โ Fedora 42 Workstation | TinyLLaMA-1.1B-Chat (Q4_K_M) | 121 s | 3 s | 33ร |
| AMD RX 580 8GB โ Ubuntu 22.04.5 LTS (Jammy) | Gemma-3n-E4B-it (6.9B) | 188.47 s | 44.74 s | 4.21ร |
| Intel Arc A770 | Gemma-3n-E4B-it (6.9B) | ~120 s | ~25 s | ~4.8ร |
| AMD RX 6600 | Gemma-3n-E4B-it (6.9B) | ~90 s | ~18 s | ~5.0ร |
iGPU notes
- The Dell E7250 iGPU result shows older integrated GPUs can be very effective for smaller LLMs when using Vulkan.
- Smaller models and appropriate quantizations are more iGPU-friendly. Driver/version differences significantly affect results.
Other tested (functional) models
DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bitโ runs (not benchmarked).LLaMA 3.1 8Bโ runs (not benchmarked).
- ROCm is not officially supported for
gfx803(RX 580). - Some community members try ROCm 5/6 workarounds on RX 580, but they are unstable/unsupported.
- VulkanIlm offers a Vulkan-based path that avoids ROCm on legacy AMD cards.
Quick start
git clone https://github.com/Talnz007/VulkanIlm.git
cd VulkanIlm
pip install -e .Prerequisites
- Python 3.9+
- Vulkan-capable GPU (AMD RX 400+, Intel Arc/Xe, NVIDIA GTX 900+)
- Vulkan drivers installed and working
Install Vulkan tools (if needed)
Ubuntu / Debian:
sudo apt update
sudo apt install vulkan-tools libvulkan-devFedora / RHEL:
sudo dnf install vulkan-tools vulkan-develVerify:
vulkaninfo# Auto-install llama.cpp with Vulkan support
vulkanilm install
# Check your GPU setup
vulkanilm vulkan-info
# Search and download models (if supported)
vulkanilm search "llama"
vulkanilm download microsoft/DialoGPT-medium
# Generate text
vulkanilm ask path/to/model.gguf --prompt "Explain quantum computing"
# Stream tokens in real-time
vulkanilm stream path/to/model.gguf "Tell me a story about AI"
# Run a benchmark
vulkanilm benchmark path/to/model.gguf --prompt "Benchmark prompt" --repeat 3from vulkan_ilm import Llama
# Load model (auto GPU optimization)
llm = Llama("path/to/model.gguf", gpu_layers=16)
# Synchronous generation
response = llm.ask("Explain the term 'ilm' in AI context.")
print(response)
# Streaming generation
for token in llm.stream_ask_real("Tell me about Vulkan API"):
print(token, end='', flush=True)- Use the exact model file & quantization referenced in
/benchmarks(GGUF + quantization). - Use the benchmark script in
benchmarks/run_benchmark.sh. - Record: driver version, OS version, CPU frequency governor, and system load.
- Run benchmarks multiple times (cold and warm cache) and average results.
- Activate venv and reinstall:
python3 -m venv venv
source venv/bin/activate
pip install -e .- Or run via Poetry:
poetry run vulkanilm install- Install
glslc(Vulkan SDK / vulkan-tools):
# Fedora
sudo dnf install glslc
# Ubuntu/Debian
sudo apt install vulkan-toolsVerify: glslc --version
- Install libcurl dev:
# Fedora
sudo dnf install libcurl-devel
# Ubuntu/Debian
sudo apt install libcurl4-openssl-devVulkanIlm/
โโโ vulkan_ilm/
โ โโโ cli.py
โ โโโ llama.py
โ โโโ vulkan/
โ โ โโโ detector.py
โ โโโ benchmark.py
โ โโโ installer.py
โ โโโ streaming.py
โโโ benchmarks/ # benchmark scripts & data
โโโ pyproject.toml
โโโ README.md
We welcome contributions! Useful areas:
- GPU testing across drivers & OSes
- Additional model formats & quant recipes
- Memory & perf optimizations
- Docs, reproducible benchmarks, and examples
See CONTRIBUTING.md for details. Look for good-first-issue tags.
Ilm (ุนูู ) = knowledge / wisdom. Combined with Vulkan โ โknowledge on fireโ: making fast local AI accessible to everyone, regardless of GPU brand or budget. ๐ฅ
MIT โ see LICENSE for details.
- Repo: https://github.com/Talnz007/VulkanIlm
- Issues: Report bugs or request features on GitHub
- Discussions: Community Q&A
- ๐ Full Documentation: https://talnz007.github.io/VulkanIlm/#/
Built with passion by @Talnz007 โ bringing fast, local AI to legacy GPUs everywhere.