llm-rag-assistant is a fully local, retrieval-augmented chatbot powered by llama-cpp-python, designed to answer questions in Spanish using your own Q&A dataset. It uses semantic search via FAISS + multilingual sentence-transformers to retrieve relevant answers, and combines it with a local instruction-tuned LLM (e.g., Mistral-7B-Instruct in GGUF format) for contextual response generation.
- 🔍 Semantic Search with multilingual embeddings (sentence-transformers)
- 🧠 Local LLM inference without a GPU using optimized GGUF models + llama-cpp-python
- 💻 Runs on standard laptops and desktops — no CUDA, no GPU, no special hardware required
- 🔒 No API keys, no cloud dependency — fully private and offline
- 🌐 Instant web interface with Streamlit
- 🐳 Docker & Docker Compose ready for easy deployment
- 🗂️ Plug-and-play with any Q&A dataset in JSON format
This package lets you run a console chatbot with semantic retrieval (RAG) on your machine, with no need for a GPU or external connection.
This version works in the console. For a UI version, see the streamlit version.
- Python 3.9+
- Install dependencies: pip install llama-cpp-python faiss-cpu sentence-transformers
Tested with python-3.13.5, specific versions in environment.yml # On macOS, if build fails try conda install -c conda-forge llama-cpp-python pip install faiss-cpu sentence-transformers
- Download the GGUF model:
For example
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf -O mistral-7b-instruct.Q4_K_M.ggufOpen source model, apache 2.0 license https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
- Build a question and answer dataset
Important: Save it in the file qa_dataset.json
It should have the following structure (example)
[
{
"pregunta": "¿Cuál es el horario de atención?",
"respuesta": "Nuestro horario de atención es de lunes a viernes de 9:00 a 18:00 horas y sábados de 9:00 a 14:00."
},
{
"pregunta": "¿Cómo puedo contactar con soporte técnico?",
"respuesta": "Puede contactar con soporte técnico a través del email [email protected], llamando al 900-123-456 o mediante el chat en vivo de nuestra web."
},
...
]- Create the config.yaml file for RAG System configuration
For example
models:
embeddings:
model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
generation:
llama_cpp_model_path: "models/mistral-7b-instruct.Q4_K_M.gguf"
max_tokens: 256Note: To work with this type of Q&A dataset, you need an instruction-tuned model.
- Add temperature configuration
- prepare_embeddings.py → generates scibot_index.faiss and qa.json from your dataset
- app.py → runs the streamlit app
- qa_dataset.json → your knowledge base
Use docker compose (see below) or run manually:
- Run: python prepare_embeddings.py
- Run: streamlit run app.py
- Chat with your knowledge base using a Spanish bot :)
- 8GB RAM minimum (16GB recommended)
- ~5GB of space for the models
docker-compose build
docker-compose up -d
docker-compose down
docker-compose logs -fOpen your browser at: http://localhost:8501
# Rebuild from scratch
docker-compose build --no-cachedocker-compose build --no-cache
# Execute inside the container
docker-compose exec rag-app python compute_embeddings.py