Skip to content

Commit 3daa75c

Browse files
committed
Update README.md
1 parent 7eba311 commit 3daa75c

File tree

1 file changed

+117
-19
lines changed

1 file changed

+117
-19
lines changed

README.md

Lines changed: 117 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,141 @@
11
# RAG Pipelines
22

3+
[![code checks](https://github.com/avnlp/rag-pipelines/actions/workflows/code_checks.yml/badge.svg)](https://github.com/avnlp/rag-pipelines/actions/workflows/code_checks.yml)
4+
[![tests](https://github.com/avnlp/rag-pipelines/actions/workflows/tests.yml/badge.svg)](https://github.com/avnlp/rag-pipelines/actions/workflows/tests.yml)
5+
[![codecov](https://codecov.io/github/avnlp/rag-pipelines/graph/badge.svg?token=83MYFZ3UPA)](https://codecov.io/github/avnlp/rag-pipelines)
36
![GitHub License](https://img.shields.io/github/license/avnlp/rag-pipelines)
47

8+
This repository contains advanced Retrieval-Augmented Generation (RAG) pipelines specifically designed for domain-specific tasks.
9+
10+
The RAG pipelines follow a standardized architecture:
11+
12+
- [**LangGraph**](https://www.langchain.com/langgraph) for workflow orchestration.
13+
- [**Unstructured**](https://unstructured.io/) for document processing.
14+
- [**Milvus**](https://milvus.io/) vector database for hybrid search and retrieval.
15+
- [**DeepEval**](https://deepeval.com/) for comprehensive evaluation metrics.
16+
- [**Confident AI**](https://www.confident-ai.com/) for tracing and debugging.
17+
18+
Each pipeline is configured through YAML files that allow for flexible customization of document processing, retrieval strategies, and generation parameters.
19+
520
## Datasets
621

7-
We evaluate the RAG pipelines on the following datasets:
22+
The project includes several domain-specific datasets:
23+
24+
- [**HealthBench**](https://openai.com/index/healthbench/): A comprehensive benchmark for evaluating medical AI systems with multi-turn conversations and expert rubric evaluations.
25+
- [**MedCaseReasoning**](https://github.com/kevinwu23/Stanford-MedCaseReasoning): Dataset containing medical case studies with detailed reasoning processes.
26+
- [**MetaMedQA**](https://github.com/maximegmd/MetaMedQA-benchmark): Medical question answering dataset based on USMLE textbook content.
27+
- [**PubMedQA**](https://pubmedqa.github.io/): Biomedical question answering dataset based on PubMed articles.
28+
- [**FinanceBench**](https://github.com/patronus-ai/financebench): FinanceBench is a question answering dataset that comprises of questions from public filings including 10Ks, 10Qs, 8Ks, and Earnings Calls.
29+
- [**Earnings Calls**](https://huggingface.co/datasets/lamini/earnings-calls-qa): Financial question answering dataset based on Earnings Call Transcripts of over 2800 companies.
30+
31+
## Pipeline Architecture
32+
33+
Each pipeline follows a consistent architecture with the following nodes:
34+
35+
- **Indexing**: Processes raw documents from the dataset, chunks them using unstructured.io data processors, extracts metadata using LLM-powered extraction, and stores them in a Milvus vector database with BM25 hybrid search capability. The indexing process also applies metadata schemas to ensure consistent metadata across documents.
36+
37+
- **Metadata Extraction**: Uses an LLM-powered extractor to parse the input question and produce a structured filter dictionary based on a predefined JSON schema. This output is used to constrain document retrieval to relevant subsets (e.g., by publication year, study type, etc.) to improve retrieval precision.
38+
39+
- **Document Retrieval**: Retrieves relevant documents using a configured retriever (typically from Milvus vector store) based on the input question and optional metadata filter from the previous step. The retrieved documents are converted into both raw Document objects and plain text for downstream use.
40+
41+
- **Document Reranking**: Reranks the retrieved documents based on their relevance to the query using a specialized contextual reranker model. This step improves the relevance of documents used for answer generation by reordering them according to their contextual similarity to the query.
42+
43+
- **Answer Generation**: Generates an answer using an LLM conditioned on the retrieved context. This node uses a pre-defined prompt template that injects the context and question into the LLM call, producing a raw string response that forms the final answer.
44+
45+
- **Evaluation**: Evaluates the generated response against the ground truth using DeepEval metrics. This node constructs an LLMTestCase from the ground truth, generated answer, and retrieved context, then runs a suite of pre-configured metrics including contextual recall, precision, relevancy, answer relevancy, and faithfulness.
46+
47+
## Components
48+
49+
### Contextual Ranker
50+
51+
The ContextualReranker uses the reranker models by [Contextual AI](https://contextual.ai/blog/introducing-instruction-following-reranker) to reorder documents based on their relevance to a given query.
52+
53+
- Uses the contextual-rerank models from HuggingFace for reranking.
54+
- Supports custom instructions to refine query context during reranking.
55+
- Uses model logits for scoring document relevance.
56+
- Preserves document metadata during reranking.
57+
58+
### Metadata Extractor
59+
60+
The MetadataExtractor extracts structured metadata from text using a language model and a user specified JSON schema.
61+
62+
- Uses LLMs with structured-output generation for metadata extraction.
63+
- Dynamically converts JSON schema into Pydantic models for type safety and validation.
64+
- Only includes successfully extracted (non-null) fields in results.
65+
- Supports string, number, and boolean field types with optional enums.
66+
67+
### Unstructured Document Loaders and Chunker
868

9-
| Dataset | Description |
10-
| :--- | :--- |
11-
| **HealthBench** | A comprehensive benchmark for evaluating medical AI, featuring multi-turn conversations and expert assessments. |
12-
| **MedCaseReasoning** | A collection of medical case studies that include detailed, step-by-step reasoning processes. |
13-
| **MetaMedQA** | A medical question-answering dataset where contexts are sourced from USMLE textbooks. |
14-
| **PubMedQA** | A biomedical question-answering dataset derived from abstracts in PubMed articles. |
69+
**UnstructuredAPIDocumentLoader**: Loads and transforms documents using the Unstructured API. It supports extracting text, tables, and images from various document formats.
1570

71+
**UnstructuredDocumentLoader**: Loads and transforms PDF documents using the Unstructured API with various processing strategies.
1672

17-
## Developing
73+
**UnstructuredChunker**: Chunks documents using different strategies from the `unstructured` library, supporting "basic" and "by_title" chunking approaches.
1874

19-
### Installing dependencies
75+
- Support for multiple document formats (PDF, DOCX, PPTX, etc.).
76+
- Various processing strategies (hi_res, auto, fast).
77+
- Configurable chunking with overlap and size parameters.
78+
- Metadata preservation during document processing.
79+
- Recursive directory processing for batch document loading.
2080

21-
The development environment can be set up using
22-
[uv](https://github.com/astral-sh/uv?tab=readme-ov-file#installation). Hence, make sure it is
23-
installed and then run:
81+
## Installation
82+
83+
The project uses [uv](https://github.com/astral-sh/uv) for dependency management. First, ensure uv is installed:
84+
85+
```bash
86+
# Install uv (if not already installed)
87+
pip install uv
88+
```
89+
90+
Then install the project dependencies:
2491

2592
```bash
93+
# Install dependencies
2694
uv sync
95+
96+
# Activate the virtual environment
2797
source .venv/bin/activate
2898
```
2999

30-
In order to install dependencies for testing (codestyle, unit tests, integration tests),
31-
run:
100+
## Usage
101+
102+
### Environment Setup
103+
104+
Create a `.env` file in the project root with the required environment variables:
105+
106+
```env
107+
GROQ_API_KEY=your_groq_api_key
108+
MILVUS_URI=your_milvus_uri
109+
MILVUS_TOKEN=your_milvus_token
110+
UNSTRUCTURED_API_KEY=your_unstructured_api_key
111+
```
112+
113+
### Indexing
114+
115+
Each dataset module includes an indexing script to process and store documents in the vector database:
116+
117+
Example for HealthBench:
32118

33119
```bash
34-
uv sync --dev
35-
source .venv/bin/activate
120+
cd src/rag_pipelines/healthbench
121+
python healthbench_indexing.py
36122
```
37123

38-
In order to exclude installation of packages from a specific group (e.g. docs),
39-
run:
124+
### RAG Evaluation
125+
126+
Each dataset module includes a RAG evaluation script to test the pipeline performance:
127+
128+
Example for HealthBench:
40129

41130
```bash
42-
uv sync --no-group docs
131+
cd src/rag_pipelines/healthbench
132+
python healthbench_rag.py
43133
```
134+
135+
## Contributing
136+
137+
Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for detailed contribution guidelines.
138+
139+
## License
140+
141+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

0 commit comments

Comments
 (0)