Skip to content

Commit fa8a413

Browse files
authored
Merge pull request #424 from are-ces/main
LCORE-169: Provide initial set of opinionated & tested llama-stack co…
2 parents 56099e8 + d14b373 commit fa8a413

File tree

9 files changed

+846
-1
lines changed

9 files changed

+846
-1
lines changed

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ The service includes comprehensive user data collection capabilities for various
3535
* [K8s based authentication](#k8s-based-authentication)
3636
* [JSON Web Keyset based authentication](#json-web-keyset-based-authentication)
3737
* [No-op authentication](#no-op-authentication)
38+
* [RAG Configuration](#rag-configuration)
3839
* [Usage](#usage)
3940
* [Make targets](#make-targets)
4041
* [Running Linux container image](#running-linux-container-image)
@@ -451,7 +452,21 @@ service:
451452
Credentials are not allowed with wildcard origins per CORS/Fetch spec.
452453
See https://fastapi.tiangolo.com/tutorial/cors/
453454

455+
# RAG Configuration
454456

457+
The [guide to RAG setup](docs/rag_guide.md) provides guidance on setting up RAG and includes tested examples for both inference and vector store integration.
458+
459+
## Example configurations for inference
460+
461+
The following configurations are llama-stack config examples from production deployments:
462+
463+
- [Granite on vLLM example](examples/vllm-granite-run.yaml)
464+
- [Qwen3 on vLLM example](examples/vllm-qwen3-run.yaml)
465+
- [Gemini example](examples/gemini-run.yaml)
466+
- [VertexAI example](examples/vertexai-run.yaml)
467+
468+
> [!NOTE]
469+
> RAG functionality is **not tested** for these configurations.
455470

456471
# Usage
457472

docs/rag_guide.md

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ Update the `run.yaml` file used by Llama Stack to point to:
6161
* Your downloaded **embedding model**
6262
* Your generated **vector database**
6363

64-
Example:
64+
### FAISS example
6565

6666
```yaml
6767
models:
@@ -100,10 +100,113 @@ Where:
100100
- `db_path` is the path to the vector index (.db file in this case)
101101
- `vector_db_id` is the index ID used to generate the db
102102

103+
See the full working [config example](examples/openai-faiss-run.yaml) for more details.
104+
105+
### pgvector example
106+
107+
This example shows how to configure a remote PostgreSQL database with the [pgvector](https://github.com/pgvector/pgvector) extension for storing embeddings.
108+
109+
> You will need to install PostgreSQL with a matching version to pgvector, then log in with `psql` and enable the extension with:
110+
> ```sql
111+
> CREATE EXTENSION IF NOT EXISTS vector;
112+
> ```
113+
114+
Update the connection details (`host`, `port`, `db`, `user`, `password`) to match your PostgreSQL setup.
115+
116+
Each pgvector-backed table follows this schema:
117+
118+
- `id` (`text`): UUID identifier of the chunk
119+
- `document` (`jsonb`): json containing content and metadata associated with the embedding
120+
- `embedding` (`vector(n)`): the embedding vector, where `n` is the embedding dimension and will match the model's output size (e.g. 768 for `all-mpnet-base-v2`)
121+
122+
> [!NOTE]
123+
> The `vector_db_id` (e.g. `rhdocs`) is used to point to the table named `vector_store_rhdocs` in the specified database, which stores the vector embeddings.
124+
125+
126+
```yaml
127+
[...]
128+
providers:
129+
[...]
130+
vector_io:
131+
- provider_id: pgvector-example
132+
provider_type: remote::pgvector
133+
config:
134+
host: localhost
135+
port: 5432
136+
db: pgvector_example # PostgreSQL database (psql -d pgvector_example)
137+
user: lightspeed # PostgreSQL user
138+
password: password123
139+
kvstore:
140+
type: sqlite
141+
db_path: .llama/distributions/pgvector/pgvector_registry.db
142+
143+
vector_dbs:
144+
- embedding_dimension: 768
145+
embedding_model: sentence-transformers/all-mpnet-base-v2
146+
provider_id: pgvector-example
147+
# A unique ID that becomes the PostgreSQL table name, prefixed with 'vector_store_'.
148+
# e.g., 'rhdocs' will create the table 'vector_store_rhdocs'.
149+
# If the table was already created, this value must match the ID used at creation.
150+
vector_db_id: rhdocs
151+
```
152+
153+
See the full working [config example](examples/openai-pgvector-run.yaml) for more details.
154+
103155
---
104156

105157
## Add an Inference Model (LLM)
106158

159+
### vLLM on RHEL AI (Llama 3.1) example
160+
161+
> [!NOTE]
162+
> The following example assumes that podman's CDI has been properly configured to [enable GPU support](https://podman-desktop.io/docs/podman/gpu).
163+
164+
The [`vllm-openai`](https://hub.docker.com/r/vllm/vllm-openai) Docker image is used to serve the Llama-3.1-8B-Instruct model.
165+
The following example shows how to run it on **RHEL AI** with `podman`:
166+
167+
```bash
168+
podman run \
169+
--device "${CONTAINER_DEVICE}" \
170+
--gpus ${GPUS} \
171+
-v ~/.cache/huggingface:/root/.cache/huggingface \
172+
--env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
173+
-p ${EXPORTED_PORT}:8000 \
174+
--ipc=host \
175+
docker.io/vllm/vllm-openai:latest \
176+
--model meta-llama/Llama-3.1-8B-Instruct \
177+
--enable-auto-tool-choice \
178+
--tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja
179+
```
180+
181+
> The example command above enables tool calling for Llama 3.1 models.
182+
> For other supported models and configuration options, see the vLLM documentation:
183+
> [vLLM: Tool Calling](https://docs.vllm.ai/en/stable/features/tool_calling.html)
184+
185+
After starting the container edit your `run.yaml` file, matching `model_id` with the model provided in the `podman run` command.
186+
187+
```yaml
188+
[...]
189+
models:
190+
[...]
191+
- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
192+
provider_id: vllm
193+
model_type: llm
194+
provider_model_id: null
195+
196+
providers:
197+
[...]
198+
inference:
199+
- provider_id: vllm
200+
provider_type: remote::vllm
201+
config:
202+
url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
203+
api_token: <your-key-here> # if any
204+
```
205+
206+
See the full working [config example](examples/vllm-llama-faiss-run.yaml) for more details.
207+
208+
### OpenAI example
209+
107210
Add a provider for your language model (e.g., OpenAI):
108211

109212
```yaml
@@ -133,6 +236,24 @@ export OPENAI_API_KEY=<your-key-here>
133236
> When experimenting with different `models`, `providers` and `vector_dbs`, you might need to manually unregister the old ones with the Llama Stack client CLI (e.g. `llama-stack-client vector_dbs list`)
134237

135238

239+
See the full working [config example](examples/openai-faiss-run.yaml) for more details.
240+
241+
### Azure OpenAI
242+
243+
Not yet supported.
244+
245+
### Ollama
246+
247+
The `remote::ollama` provider can be used for inference. However, it does not support tool calling, including RAG.
248+
While Ollama also exposes an OpenAI compatible endpoint that supports tool calling, it cannot be used with `llama-stack` due to current limitations in the `remote::openai` provider.
249+
250+
There is an [ongoing discussion](https://github.com/meta-llama/llama-stack/discussions/3034) about enabling tool calling with Ollama.
251+
Currently, tool calling is not supported out of the box. Some experimental patches exist (including internal workarounds), but these are not officially released.
252+
253+
### vLLM Mistral
254+
255+
The RAG tool calls where not working properly when experimenting with `mistralai/Mistral-7B-Instruct-v0.3` on vLLM.
256+
136257
---
137258

138259
# Complete Configuration Reference

examples/gemini-run.yaml

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Example llama-stack configuration for Google Gemini inference
2+
#
3+
# Contributed by @eranco74 (2025-08). See https://github.com/rh-ecosystem-edge/assisted-chat/blob/main/template.yaml#L282-L386
4+
# This file shows how to integrate Gemini with LCS.
5+
#
6+
# Notes:
7+
# - You will need valid Gemini API credentials to run this.
8+
# - You will need a postgres instance to run this config.
9+
#
10+
version: 2
11+
image_name: gemini-config
12+
apis:
13+
- agents
14+
- datasetio
15+
- eval
16+
- files
17+
- inference
18+
- safety
19+
- scoring
20+
- telemetry
21+
- tool_runtime
22+
- vector_io
23+
providers:
24+
inference:
25+
- provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
26+
provider_type: remote::gemini
27+
config:
28+
api_key: ${env.GEMINI_API_KEY}
29+
vector_io: []
30+
files: []
31+
safety: []
32+
agents:
33+
- provider_id: meta-reference
34+
provider_type: inline::meta-reference
35+
config:
36+
persistence_store:
37+
type: postgres
38+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
39+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
40+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
41+
user: ${env.LLAMA_STACK_POSTGRES_USER}
42+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
43+
responses_store:
44+
type: postgres
45+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
46+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
47+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
48+
user: ${env.LLAMA_STACK_POSTGRES_USER}
49+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
50+
telemetry:
51+
- provider_id: meta-reference
52+
provider_type: inline::meta-reference
53+
config:
54+
service_name: "${LLAMA_STACK_OTEL_SERVICE_NAME}"
55+
sinks: ${LLAMA_STACK_TELEMETRY_SINKS}
56+
sqlite_db_path: ${STORAGE_MOUNT_PATH}/sqlite/trace_store.db
57+
eval: []
58+
datasetio: []
59+
scoring:
60+
- provider_id: basic
61+
provider_type: inline::basic
62+
config: {}
63+
- provider_id: llm-as-judge
64+
provider_type: inline::llm-as-judge
65+
config: {}
66+
tool_runtime:
67+
- provider_id: rag-runtime
68+
provider_type: inline::rag-runtime
69+
config: {}
70+
- provider_id: model-context-protocol
71+
provider_type: remote::model-context-protocol
72+
config: {}
73+
metadata_store:
74+
type: sqlite
75+
db_path: ${STORAGE_MOUNT_PATH}/sqlite/registry.db
76+
inference_store:
77+
type: postgres
78+
host: ${env.LLAMA_STACK_POSTGRES_HOST}
79+
port: ${env.LLAMA_STACK_POSTGRES_PORT}
80+
db: ${env.LLAMA_STACK_POSTGRES_NAME}
81+
user: ${env.LLAMA_STACK_POSTGRES_USER}
82+
password: ${env.LLAMA_STACK_POSTGRES_PASSWORD}
83+
models:
84+
- metadata: {}
85+
model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
86+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
87+
provider_model_id: ${LLAMA_STACK_2_0_FLASH_MODEL}
88+
model_type: llm
89+
- metadata: {}
90+
model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
91+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
92+
provider_model_id: ${LLAMA_STACK_2_5_PRO_MODEL}
93+
model_type: llm
94+
- metadata: {}
95+
model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
96+
provider_id: ${LLAMA_STACK_INFERENCE_PROVIDER}
97+
provider_model_id: ${LLAMA_STACK_2_5_FLASH_MODEL}
98+
model_type: llm
99+
shields: []
100+
vector_dbs: []
101+
datasets: []
102+
scoring_fns: []
103+
benchmarks: []
104+
tool_groups:
105+
- toolgroup_id: builtin::rag
106+
provider_id: rag-runtime
107+
- toolgroup_id: mcp::assisted
108+
provider_id: model-context-protocol
109+
mcp_endpoint:
110+
uri: "${MCP_SERVER_URL}"
111+
server:
112+
port: ${LLAMA_STACK_SERVER_PORT}

examples/openai-faiss-run.yaml

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Example llama-stack configuration for OpenAI inference + FAISS (RAG)
2+
#
3+
# Notes:
4+
# - You will need an OpenAI API key
5+
# - You can generate the vector index with the rag-content tool (https://github.com/lightspeed-core/rag-content)
6+
#
7+
version: 2
8+
image_name: openai-faiss-config
9+
10+
apis:
11+
- agents
12+
- inference
13+
- vector_io
14+
- tool_runtime
15+
- safety
16+
17+
models:
18+
- model_id: gpt-test
19+
provider_id: openai # This ID is a reference to 'providers.inference'
20+
model_type: llm
21+
provider_model_id: gpt-4o-mini
22+
23+
- model_id: sentence-transformers/all-mpnet-base-v2
24+
metadata:
25+
embedding_dimension: 768
26+
model_type: embedding
27+
provider_id: sentence-transformers # This ID is a reference to 'providers.inference'
28+
provider_model_id: /home/USER/lightspeed-stack/embedding_models/all-mpnet-base-v2
29+
30+
providers:
31+
inference:
32+
- provider_id: sentence-transformers
33+
provider_type: inline::sentence-transformers
34+
config: {}
35+
36+
- provider_id: openai
37+
provider_type: remote::openai
38+
config:
39+
api_key: ${env.OPENAI_API_KEY}
40+
41+
agents:
42+
- provider_id: meta-reference
43+
provider_type: inline::meta-reference
44+
config:
45+
persistence_store:
46+
type: sqlite
47+
db_path: .llama/distributions/ollama/agents_store.db
48+
responses_store:
49+
type: sqlite
50+
db_path: .llama/distributions/ollama/responses_store.db
51+
52+
safety:
53+
- provider_id: llama-guard
54+
provider_type: inline::llama-guard
55+
config:
56+
excluded_categories: []
57+
58+
vector_io:
59+
- provider_id: ocp-docs
60+
provider_type: inline::faiss
61+
config:
62+
kvstore:
63+
type: sqlite
64+
db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db
65+
namespace: null
66+
67+
tool_runtime:
68+
- provider_id: rag-runtime
69+
provider_type: inline::rag-runtime
70+
config: {}
71+
72+
# Enable the RAG tool
73+
tool_groups:
74+
- provider_id: rag-runtime
75+
toolgroup_id: builtin::rag
76+
args: null
77+
mcp_endpoint: null
78+
79+
vector_dbs:
80+
- embedding_dimension: 768
81+
embedding_model: sentence-transformers/all-mpnet-base-v2
82+
provider_id: ocp-docs # This ID is a reference to 'providers.vector_io'
83+
vector_db_id: openshift-index # This ID was defined during index generation

0 commit comments

Comments
 (0)