llama-swappo

A fork of llama-swap with a minimally implemented ollama compatible api grafted onto it, so you can use it with clients that only support ollama.

This makes llama-swappo a drop in replacement for ollama, for enthusiests that want more control, with more compatability.

This commit automatically rebases onto the latest llama-swap nightly.

Features

✅ Ollama API supported endpoints:
- HEAD / - for health check
- api/tags - to list models
- api/show - for model details
- api/ps - to show what's running
- api/generate (untested, clients I've used so far seem to use the OpenAI compatible endpoints for actual generation and chat)
- api/chat (untested)
- api/embed
- api/embeddings

How to install

Use the original Building from source instructions, and overwrite your installed llama-swap executable with the newly built one.

Configuration

If you're using llama-server, it will try to parse your arguments for the additional metadata it needs like context length. Alternatively, you can define the values in your config, which will override the inferred values.

model1:
  cmd: path/to/cmd --arg1 one
  proxy: "http://localhost:8080"

  # these
  metadata:
    architecture: qwen3
    contextLength: 131072
    capabilities:
    - completion # for chat models
    - tools # for tool use (requires --jinja in llama-server, and you must compile with this PR included https://github.com/ggml-org/llama.cpp/pull/12379)
    - insert # for FITM coding, untested
    - vision # untested
    - embedding #untested
    family: qwen # probably not needed
    parameterSize: 32B # probably not needed
    quantizationLevel: 4Q_K_M # probably not needed

Support

This was a personal tweak so I could play with local models in Github Copilot without having to deal with ollama. I offered to merge this into the upstream repo, but the maintainer decided, and I agree, that this change overcomplicates the elegance of llama-swap, and would be too much of a burden to maintain forever. My interests have already swung back to some other projects, so I don't intend to support this seriously. I won't be providing docker images or anything else. I'll accept pull requests if you fix something though.

Original README follows

llama-swap

Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.

Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.

Features:

✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
✅ On-demand model switching
✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- future proof, upgrade your inference servers at any time.
✅ OpenAI API supported endpoints:
- v1/completions
- v1/chat/completions
- v1/embeddings
- v1/audio/speech (#36)
- v1/audio/transcriptions (docs)
✅ llama-server (llama.cpp) supported endpoints
- v1/rerank, v1/reranking, /rerank
- /infill - for code infilling
- /completion - for completion endpoint
✅ llama-swap API
- /ui - web UI
- /upstream/:model_id - direct access to upstream server (demo)
- /models/unload - manually unload running models (#58)
- /running - list currently running models (#61)
- /log - remote log monitoring
- /health - just returns "OK"
✅ Customizable
- Run multiple models at once with Groups (#107)
- Automatic unloading of models after timeout by setting a ttl
- Reliable Docker and Podman support using cmd and cmdStop together
- Preload models on startup with hooks (#235)

Web UI

llama-swap includes a real time web interface for monitoring logs and controlling models:

The Activity Page shows recent requests:

Installation

llama-swap can be installed in multiple ways

Docker
Homebrew (OSX and Linux)
WinGet
From release binaries
From source

Docker Install (download images)

Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc).

$ docker pull ghcr.io/mostlygeek/llama-swap:cuda

# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
 -v /path/to/models:/models \
 -v /path/to/custom/config.yaml:/app/config.yaml \
 ghcr.io/mostlygeek/llama-swap:cuda

more examples

# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa

# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795

Homebrew Install (macOS/Linux)

brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080

WinGet Install (Windows)

Note

WinGet is maintained by community contributor Dvd-Znf (#327). It is not an official part of llama-swap.

# install
C:\> winget install llama-swap

# upgrade
C:\> winget upgrade llama-swap

Pre-built Binaries

Binaries are available on the release page for Linux, Mac, Windows and FreeBSD.

Building from source

Building requires Go and Node.js (for UI).
git clone https://github.com/mostlygeek/llama-swap.git
make clean all
look in the build/ subdirectory for the llama-swap binary

Configuration

# minimum viable config.yaml

models:
  model1:
    cmd: llama-server --port ${PORT} --model /path/to/model.gguf

That's all you need to get started:

models - holds all model configurations
model1 - the ID used in API calls
cmd - the command to run to start the server.
${PORT} - an automatically assigned port number

Almost all configuration settings are optional and can be added one step at a time:

Advanced features
- groups to run multiple models at once
- hooks to run things on startup
- macros reusable snippets
Model customization
- ttl to automatically unload models
- aliases to use familiar model names (e.g., "gpt-4o-mini")
- env to pass custom environment variables to inference servers
- cmdStop gracefully stop Docker/Podman containers
- useModelName to override model names sent to upstream servers
- ${PORT} automatic port variables for dynamic port assignment
- filters rewrite parts of requests before sending to the upstream server

See the configuration documentation for all options.

How does llama-swap work?

When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

Reverse Proxy Configuration (nginx)

If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. (#236)

Recommended nginx configuration snippets:

# SSE for UI events/logs
location /api/events {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

# Streaming chat completions (stream=true)
location /v1/chat/completions {
    proxy_pass http://your-llama-swap-backend;
    proxy_buffering off;
    proxy_cache off;
}

As a safeguard, llama-swap also sets X-Accel-Buffering: no on SSE responses. However, explicitly disabling proxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.

Monitoring Logs on the CLI

# sends up to the last 10KB of logs
curl http://host/logs'

# streams combined logs
curl -Ns 'http://host/logs/stream'

# just llama-swap's logs
curl -Ns 'http://host/logs/stream/proxy'

# just upstream's logs
curl -Ns 'http://host/logs/stream/upstream'

# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'

# skips history and just streams new log entries
curl -Ns 'http://host/logs/stream?no-history'

Do I need to use llama.cpp's server (llama-server)?

Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.

For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals for proper shutdown.

Star History

Note

⭐️ Star this project to help others discover it!

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
.github		.github
ai-plans		ai-plans
cmd		cmd
docker		docker
docs		docs
event		event
models		models
proxy		proxy
scripts		scripts
ui		ui
.coderabbit.yaml		.coderabbit.yaml
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
go.mod		go.mod
go.sum		go.sum
header.jpeg		header.jpeg
header2.png		header2.png
llama-swap.go		llama-swap.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama-swappo

Features

How to install

Configuration

Support

Original README follows

llama-swap

Features:

Web UI

Installation

Docker Install (download images)

Homebrew Install (macOS/Linux)

WinGet Install (Windows)

Pre-built Binaries

Building from source

Configuration

How does llama-swap work?

Reverse Proxy Configuration (nginx)

Monitoring Logs on the CLI

Do I need to use llama.cpp's server (llama-server)?

Star History

About

Uh oh!

Releases

Packages

Languages

License

kooshi/llama-swappo

Folders and files

Latest commit

History

Repository files navigation

llama-swappo

Features

How to install

Configuration

Support

Original README follows

llama-swap

Features:

Web UI

Installation

Docker Install (download images)

Homebrew Install (macOS/Linux)

WinGet Install (Windows)

Pre-built Binaries

Building from source

Configuration

How does llama-swap work?

Reverse Proxy Configuration (nginx)

Monitoring Logs on the CLI

Do I need to use llama.cpp's server (llama-server)?

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages