| Rust Documentation | Python Documentation | Discord | Matrix |
Please submit requests for new models here.
-
Deploy with our easy to use APIs
After following installation instructions
-
🦙📷 Run the Llama 3.2 Vision Model: documentation and guide here
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama -
🔥🧠 AnyMoE: Build a memory-efficient MoE model from anything, in seconds
./mistralrs-server -i toml -f toml-selectors/anymoe_lora.toml -
φ³ Run the new Phi 3.5/3.1/3 model with 128K context window
./mistralrs-server -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3 -
🌀 Run the Phi 3.5 MoE model with 128K context window: documentation and guide here
./mistralrs-server -i plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe -
φ³ 📷 Run the Phi 3 vision model: documentation and guide here
./mistralrs-server --port 1234 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v -
🌲📷 Run the FLUX.1 diffusion model: documentation and guide here
./mistralrs-server --port 1234 diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux -
Other models: see a support matrix and how to run them
Mistal.rs supports several model categories:
- Text to Text
- Text+Image to Text: Vision (see the docs)
- Text to Image: Image Generation (see the docs)
Easy:
- Lightweight OpenAI API compatible HTTP server
- Python API
- Grammar support with Regex and Yacc
- ISQ (In situ quantization): run
.safetensorsmodels directly from 🤗 Hugging Face by quantizing in-place
Fast:
- Apple silicon support: ARM NEON, Accelerate, Metal
- Accelerated CPU inference with MKL, AVX support
- CUDA support with flash attention and cuDNN.
- Device mapping: load and run some layers on the device and the rest on the CPU.
Quantization:
- Details
- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with ISQ support.
- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit
- HQQ: 4-bit and 8 bit, with ISQ support
Powerful:
- LoRA support with weight merging
- First X-LoRA inference platform with first class support
- AnyMoE: Build a memory-efficient MoE model from anything, in seconds
- Various sampling and penalty methods
- Tool calling: docs
- Prompt chunking: process large prompts in a more manageable way
Advanced features:
- PagedAttention and continuous batching
- Prefix caching
- Topology: Configure ISQ and device mapping easily
- UQFF: Quantized file format for easy mixing of quants, see some models which have already been converted.
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter activation with adapter preloading: examples and docs
Documentation for mistral.rs can be found here.
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
phi3_isq_demo.mp4
Note: See supported models for more information
| Model | Supports quantization | Supports adapters | Supports device mapping | Supported by AnyMoE |
|---|---|---|---|---|
| Mistral v0.1/v0.2/v0.3 | ✅ | ✅ | ✅ | ✅ |
| Gemma | ✅ | ✅ | ✅ | ✅ |
| Llama 3.1/3.2 | ✅ | ✅ | ✅ | ✅ |
| Mixtral | ✅ | ✅ | ✅ | |
| Phi 2 | ✅ | ✅ | ✅ | ✅ |
| Phi 3 | ✅ | ✅ | ✅ | ✅ |
| Phi 3.5 MoE | ✅ | ✅ | ||
| Qwen 2.5 | ✅ | ✅ | ✅ | |
| Phi 3 Vision | ✅ | ✅ | ✅ | |
| Idefics 2 | ✅ | ✅ | ✅ | |
| Gemma 2 | ✅ | ✅ | ✅ | ✅ |
| Starcoder 2 | ✅ | ✅ | ✅ | ✅ |
| LLaVa Next | ✅ | ✅ | ✅ | |
| LLaVa | ✅ | ✅ | ✅ | |
| Llama 3.2 Vision | ✅ | ✅ |
Rust multithreaded/async API for easy integration into any application.
- Docs
- Examples
- To install: Add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API for mistral.rs.
OpenAI API compatible API server
- CUDA:
- Compile with the
cudafeature:--features cuda - FlashAttention support: compile with the
flash-attnfeature - cuDNN support: compile with the
cudnnfeature:--features cudnn
- Compile with the
- Metal:
- Compile with the
metalfeature:--features metal
- Compile with the
- CPU:
- Intel MKL: compile with the
mklfeature:--features mkl - Apple Accelerate: compile with the
acceleratefeature:--features accelerate - ARM NEON and AVX are used automatically
- Intel MKL: compile with the
Enabling features is done by passing --features ... to the build system. When using cargo run or maturin develop, pass the --features flag before the -- separating build flags from runtime flags.
- To enable a single feature like
metal:cargo build --release --features metal. - To enable multiple features, specify them in quotes:
cargo build --release --features "cuda flash-attn cudnn".
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
Note: You can use pre-built
mistralrs-serverbinaries here
- Install the Python package here.
-
Install required packages
OpenSSL(Example on Ubuntu:sudo apt install libssl-dev)- Linux only:
pkg-config(Example on Ubuntu:sudo apt install pkg-config)
-
Install Rust: https://rustup.rs/
Example on Ubuntu:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_sourceparameters in Python or the command line.)- Note: you can install
huggingface-clias documented here.
huggingface-cli login
- Note: you can install
-
Download the code
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs -
Build or install
-
Base build command
cargo build --release
-
Build with CUDA support
cargo build --release --features cuda
-
Build with CUDA and Flash Attention V2 support
cargo build --release --features "cuda flash-attn" -
Build with Metal support
cargo build --release --features metal
-
Build with Accelerate support
cargo build --release --features accelerate
-
Build with MKL support
cargo build --release --features mkl
-
Install with
cargo installfor easy command line usagePass the same values to
--featuresas you would forcargo buildcargo install --path mistralrs-server --features cuda
-
-
The build process will output a binary
misralrs-serverat./target/release/mistralrs-serverwhich may be copied into the working directory with the following command:Example on Ubuntu:
cp ./target/release/mistralrs-server ./mistralrs-server -
Use our APIs and integrations
There are 2 ways to get models with mistral.rs:
- From Hugging Face Hub (easiest)
- From local files
- Running a GGUF model
- Specify local paths
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
literal:<value>: Load from a specified literalenv:<value>: Load from a specified environment variablepath:<value>: Load from a specified filecache: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.none: Use no HF token
This is passed in the following ways:
- Command line:
./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3- Python:
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using none).
You can also instruct mistral.rs to load models fully locally by modifying the *_model_id arguments or options:
./mistralrs-server --port 1234 plain -m . -a mistralThroughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
--model-id(server) ormodel_id(python/rust) or--tok-model-id(server) ortok_model_id(python/rust):config.jsontokenizer_config.jsontokenizer.json(if not specified separately).safetensors/.bin/.pth/.ptfiles (defaults to.safetensors)preprocessor_config.json(required for vision models).processor_config.json(optional for vision models).
--quantized-model-id(server) orquantized_model_id(python/rust):- Specified
.ggufor.ggmlfile.
- Specified
--x-lora-model-id(server) orxlora_model_id(python/rust):xlora_classifier.safetensorsxlora_config.json- Adapters
.safetensorsandadapter_config.jsonfiles in their respective directories
--adapters-model-id(server) oradapters_model_id(python/rust):- Adapters
.safetensorsandadapter_config.jsonfiles in their respective directories
- Adapters
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf -t microsoft/Phi-3.5-mini-instructIf the specified tokenizer model ID contains a tokenizer.json, then it will be used over the GGUF tokenizer.
Using the builtin tokenizer:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf(or using a local file):
./mistralrs-server gguf -m path/to/files -f Phi-3.5-mini-instruct-Q4_K_M.ggufThere are a few more ways to configure:
Chat template:
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.
./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.ggufTokenizer
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
llama(sentencepiece)gpt2(BPE)
Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>. Please run ./mistralrs-server --help to see the subcommands.
Additionally, for models without quantization, the model architecture should be provided as the --arch or -a argument in contrast to GGUF models which encode the architecture in the file.
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32,f16,bf16orautoto choose based on the device. This is specified in the--dype/-dparameter after the model architecture (plain).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
mistralgemmamixtralllamaphi2phi3phi3.5moeqwen2gemma2starcoder2
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32,f16,bf16orautoto choose based on the device. This is specified in the--dype/-dparameter after the model architecture (vision-plain).
phi3videfics2llava_nextllavavllama
Plain:
llamaphi2phi3starcoder2
With adapters:
llamaphi3
You can launch interactive mode, a simple chat application running in the terminal, by passing -i:
./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3Vision models work too:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllamaAnd even diffusion models:
./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a fluxYou can an HTTP server
./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moeWe provide a method to select models with a .toml file. The keys are the same as the command line, with no_kv_cache and tokenizer_json being "global" keys.
Example:
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml| Device | Mistral.rs Completion T/s | Llama.cpp Completion T/s | Model | Quant |
|---|---|---|---|---|
| A10 GPU, CUDA | 86 | 83 | mistral-7b | 4_K_M |
| Intel Xeon 8358 CPU, AVX | 11 | 23 | mistral-7b | 4_K_M |
| Raspberry Pi 5 (8GB), Neon | 2 | 3 | mistral-7b | 2_K |
| A100 GPU, CUDA | 131 | 134 | mistral-7b | 4_K_M |
| RTX 6000 GPU, CUDA | 103 | 96 | mistral-7b | 4_K_M |
Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32
Please submit more benchmarks via raising an issue!
Quantization support
| Model | GGUF | GGML | ISQ |
|---|---|---|---|
| Mistral | ✅ | ✅ | |
| Gemma | ✅ | ||
| Llama | ✅ | ✅ | ✅ |
| Mixtral | ✅ | ✅ | |
| Phi 2 | ✅ | ✅ | |
| Phi 3 | ✅ | ✅ | |
| Phi 3.5 MoE | ✅ | ||
| Qwen 2.5 | ✅ | ||
| Phi 3 Vision | ✅ | ||
| Idefics 2 | ✅ | ||
| Gemma 2 | ✅ | ||
| Starcoder 2 | ✅ | ✅ | |
| LLaVa Next | ✅ | ||
| LLaVa | ✅ | ||
| Llama 3.2 Vision | ✅ |
Device mapping support
| Model category | Supported |
|---|---|
| Plain | ✅ |
| GGUF | ✅ |
| GGML | |
| Vision Plain | ✅ |
X-LoRA and LoRA support
| Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
|---|---|---|---|
| Mistral | ✅ | ✅ | |
| Gemma | ✅ | ||
| Llama | ✅ | ✅ | ✅ |
| Mixtral | ✅ | ✅ | |
| Phi 2 | ✅ | ||
| Phi 3 | ✅ | ✅ | |
| Phi 3.5 MoE | |||
| Qwen 2.5 | |||
| Phi 3 Vision | |||
| Idefics 2 | |||
| Gemma 2 | ✅ | ||
| Starcoder 2 | ✅ | ||
| LLaVa Next | |||
| LLaVa | |||
| Llama 3.2 Vision |
AnyMoE support
| Model | AnyMoE |
|---|---|
| Mistral 7B | ✅ |
| Gemma | ✅ |
| Llama | ✅ |
| Mixtral | |
| Phi 2 | ✅ |
| Phi 3 | ✅ |
| Phi 3.5 MoE | |
| Qwen 2.5 | ✅ |
| Phi 3 Vision | |
| Idefics 2 | |
| Gemma 2 | ✅ |
| Starcoder 2 | ✅ |
| LLaVa Next | ✅ |
| LLaVa | ✅ |
| Llama 3.2 Vision |
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
- Plain: Model id
- Quantized: Quantized model id, quantized filename, and tokenizer id
- X-LoRA: Model id, X-LoRA ordering
- X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- LoRA: Model id, LoRA ordering
- LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
- Vision Plain: Model id
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json file, it is not necessary to specify. Otherwise, templates can be found in chat_templates and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here. Examples may be found here.
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
- Debugging with the environment variable
MISTRALRS_DEBUG=1causes the following things- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
mistralrs_gguf_tensors.txtormistralrs_ggml_tensors.txt
- More logging.
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- Setting the CUDA compiler path:
- Set the
NVCC_CCBINenvironment variable during build.
- Set the
- Error:
recompile with -fPIE:- Some Linux distributions require compiling with
-fPIE. - Set the
CUDA_NVCC_FLAGSenvironment variable to-fPIEduring build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
- Error
CUDA_ERROR_NOT_FOUNDor symbol not found when using a normal or vison model:- For non-quantized models, you can specify the data type to load and run in. This must be one of
f32,f16,bf16orautoto choose based on the device.
- For non-quantized models, you can specify the data type to load and run in. This must be one of
This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
