A user-friendly GUI (Tkinter) to easily configure and launch the llama.cpp and ik_llama
server, manage model configurations, set environment variables, and generate launch scripts.
This python script provides a comprehensive graphical interface for llama.cpp and ik_llama
's server, simplifying the managing of command-line arguments and models.
- Intuitive GUI: Easy-to-use Tkinter interface with tabbed sections for:
- Main Settings (paths, model selection, basic parameters)
- Advanced Settings (GPU, memory, cache, performance, generation)
- Chat Templates (select predefined, use model default, or provide custom)
- Environment Variables (manage CUDA and custom variables)
- Configurations (save/load/import/export launch setups)
-
Comprehensive Parameter Control: Fine-tune your
llama.cpp
server:- Model Management: Scan directories for GGUF models, automatic model analysis (layers, architecture, size) with fallbacks, manual model info entry.
- Core Parameters: Threads (main & batch), context size, batch sizes (prompt & ubatch), sampling (temperature, min_p, seed).
- GPU Offloading: GPU layers, tensor split (with VRAM-based recommendations), main GPU selection, Flash Attention toggle.
- Memory & Cache: KV cache types (K & V), mmap, mlock, no KV offload.
- Network: Host IP and port configuration.
- Generation: Ignore EOS, n_predict (max tokens).
- Custom Arguments: Pass any additional
llama.cpp
server parameters. - ik_llama Support: Added support for ik_lamma with seperate parameters tab (6/15/2025)
-
System & GPU Insights:
- Detects and displays CUDA GPU(s) (via PyTorch), system RAM, and CPU core information.
- Supports manual GPU configuration if automatic detection is unavailable.
- Chat Template Flexibility:
- Load predefined chat templates from
chat_templates.json
. - Option to let
llama.cpp
decide the template based on model metadata. - Provide your own custom Jinja2 template string.
- Load predefined chat templates from
- Environment Variable Management:
- Easily enable/disable common CUDA environment variables (e.g.,
GGML_CUDA_FORCE_MMQ
). - Add and manage custom environment variables to fine tune CUDA performance.
- Easily enable/disable common CUDA environment variables (e.g.,
- Configuration Hub:
- Save, load, and delete named launch configurations.
- Import and export configurations to JSON for sharing or backup.
- Application settings (last used paths, UI preferences) are remembered.
- Script Generation:
- Generate ready-to-use PowerShell (
.ps1
) and Bash (.sh
) scripts from your current settings (including environment variables).
- Generate ready-to-use PowerShell (
- Cross-Platform Design:
- Works on Windows (tested), Linux (tested), and macOS (untested).
- Includes platform-specific considerations for venv activation (for GPU recognition) and terminal launching.
- Dependency Awareness:
- Checks for optional but recommended dependencies for GPU detection and model information
- Python 3.7+ with tkinter support (typically included with Python)
- llama.cpp built with server support (
llama-server and ik_llama
executable) - requests - Required for version checking and updates
- Install with:
pip install requests
- Install with:
- PyTorch (
torch
) - Required if you want automatic GPU detection and selection- Install in your virtual environment:
pip install torch
- Without PyTorch, you can still manually configure GPU settings
- Enables automatic CUDA device detection and system resource information
- Install in your virtual environment:
- llama-cpp-python - Optional fallback for GGUF model analysis
- Install in your virtual environment:
pip install llama-cpp-python
- Provides enhanced model analysis when llama.cpp tools are unavailable
- The launcher works without it using built-in GGUF parsing and llama.cpp tools
- Install in your virtual environment:
- psutil - Optional for enhanced system information
- Provides detailed CPU and RAM information across platforms
- Install with:
pip install psutil
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Option 1: Install using requirements.txt (recommended)
pip install -r requirements.txt
# Option 2: Install dependencies individually
pip install requests torch llama-cpp-python psutil
git clone https://github.com/thad0ctor/llama-server-launcher.git
cd llama-server-launcher
Install the required Python dependencies using the provided requirements.txt file:
pip install -r requirements.txt
Or follow the Dependencies section above to install dependencies individually.
You'll need to build llama.cpp or ik_llama
separately and point the launcher to the build directory. Here's an example build configuration:
⚠️ Example Environment Disclaimer:
The following build example was tested on Ubuntu 24.04 with CUDA 12.9 and GCC 13. Your build flags may need adjustment based on your system configuration, CUDA version, GCC version, and GPU architecture.
# Navigate to your llama.cpp (or ik_llama) directory
cd /path/to/llama.cpp
# Clean previous builds
rm -rf build CMakeCache.txt CMakeFiles
mkdir build && cd build
# Configure with CUDA support and optimization flags
CC=/usr/bin/gcc-13 CXX=/usr/bin/g++-13 cmake .. \
-DGGML_CUDA=on \
-DGGML_CUDA_FORCE_MMQ=on \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_FLAGS="--use_fast_math"
# Build with all available cores
make -j$(nproc)
📚 Need More Build Help?
For additional building guidance, platform-specific instructions, and troubleshooting, refer to the official llama.cpp documentation.
Key Build Flags Explained:
-DGGML_CUDA=on
- Enables CUDA support-DGGML_CUDA_FORCE_MMQ=on
- Forces use of multi-matrix quantization for better performance-DCMAKE_CUDA_ARCHITECTURES=120
- Targets specific GPU architecture (adjust for your GPU)-DCMAKE_CUDA_FLAGS="--use_fast_math"
- Enables fast math optimizations
- Run the launcher:
python llamacpp-server-launcher.py
- In the Main tab, set the "llama.cpp Directory" to your llama.cpp build folder
- The launcher will automatically find the
llama-server
executable
This launcher aims to streamline your llama.cpp
server workflow when working with and testing multiple models while making it more accessible and efficient for both new and experienced users.