Retrieval Augmented Generation (RAG) is a powerful method that enhances Large Language Models (LLMs) by combining their generative capabilities with relevant information retrieved from external databases. RAG enables chatbots and similar applications to produce contextually accurate and up-to-date responses by fetching pertinent information from a knowledge base or documents at runtime.
In RAG, documents are embedded into vector representations and stored in a Vector Database (VectorDB). When a user poses a query, the system retrieves the most relevant document embeddings and provides them as context to the LLM, resulting in more informed and precise answers.
This tutorial demonstrates building a RAG-enabled chatbot optimized for Arm architecture using open-source technologies such as llama-cpp-python and FAISS. Specifically designed for Raspberry Pi 5 (8GB RAM, at least 32GB Disk), the chatbot integrates the Llama-3.1-8B model for document retrieval, leveraging llama-cpp-python's optimized backend for high-performance inference.
First, clone this repository to your Raspberry Pi:
cd ~
git clone https://github.com/jc2409/RAG_Raspberry_Pi5.git
cd RAG_Raspberry_Pi5
Run the following commands to install necessary packages:
sudo apt update
sudo apt install python3-pip python3-venv cmake -y
Set up the virtual environment:
python3 -m venv rag-env
source rag-env/bin/activate
pip install -r requirements.txt
Install llama-cpp-python
optimized for Arm CPUs in the RAG_Raspberry_Pi5 folder:
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
Create a models directory in the RAG_Raspberry_Pi5 folder and download the model:
mkdir models
cd models
wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf
Clone and build llama.cpp
:
cd ~/RAG_Raspberry_Pi5
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native" -DLLAMA_CURL=OFF
cmake --build . -v --config Release -j $(nproc)
Quantize the model:
cd bin
./llama-quantize --allow-requantize ~/RAG_Raspberry_Pi5/models/ggml-model-Q4_K_M.gguf ~/RAG_Raspberry_Pi5/models/llama3.1-8b-instruct.Q4_0_arm.gguf Q4_0
With your virtual environment active, run llm.py
to verify basic inference:
source rag-env/bin/activate
python llm.py
To set up RAG with your dataset:
- Import Sample Data:
Ensure you have a Kaggle API token (Kaggle API instructions), then import data:
python import_data.py
- Embed and Store Data in VectorDB:
Run the embedding script:
python vector_embedding.py
⚠️ Note: This step can take several hours, depending on the size of your dataset and hardware.💡 Tip: You can reduce the time by:
- Using a smaller dataset
- Reducing the number of documents to embed
- Lowering the embedding model size (e.g., switching to a smaller transformer model)
- Run the RAG Application:
Test the chatbot with retrieval capabilities:
python rag.py
Your chatbot is now configured to generate informed responses using a combination of embedded documents and the LLM's generative strengths.
We evaluated the performance of the RAG-enabled chatbot by comparing responses from two versions of the LLM—one without context (basic LLM) and one utilizing context (RAG-enabled LLM).
When the user asked the question, How long was Lincoln's formal education?
, the basic LLM provided an incorrect response of 12 years due to a lack of accurate contextual information.
In contrast, the RAG-enabled LLM successfully retrieved relevant information from the VectorDB and provided an accurate response based on the retrieved context.
The data stored in the vector database containing information about Lincoln's education.