This project aims to provide the end-to-end interface for RAG from documents.
- LLM is self-hosted using ollama: llama3.2
- Python 3.11.9
- LangChain 0.3.26
- instructor 1.9.0
- chainlit 2.6.0
Document-RAGx aims to provide universal interface for hybrid retriever.
- Preprocessing Pipeline
1. A list of documents is parsed using PyMuPDF implemented in LangChain.
2. Parsed documents are splitted using RecursiveCharacterTextSplitter.
3. Keywords are extracted first from documents using structured-output-generation by instructor leveraging llama3.2.
4. Extracted keywords and embeddings are stored in Chroma Vector DB.
- Retrieval & Generation Pipeline
1. Keywords are extracted from user's query.
2. Extracted keywords and embedding of the query is used for hybrid retrieval.
3. Retrieved result is used for further generation.
Chat interface is implemented using chainlit
-
Run
python preprocess.py --data_dir ./data --strategy naive --mode new
-
data_dir
could be an arbitrary directory containing PDF files. -
strategy
should benaive
. Currently onlynaive
strategy (simple hybrid retrieval & generation) is implemented. -
mode
could benew
oradd
.In
new
mode generates a newsnapshot.json
that represents all the documents inside the data directory and creates knowledge base.In
add
mode determines the new files based on the originalsnapshot.json
and add additional documents to knowledge base.
-
-
Activate virtual environment
.venv\Scripts\activate
- Make sure you installed required packages
-
chainlit run app.py
-
Visit http://localhost:8000
If you have any issues, please visit here and leave me a message! :-)