|
1 | 1 | # RAG Pipelines |
2 | 2 |
|
| 3 | +[](https://github.com/avnlp/rag-pipelines/actions/workflows/code_checks.yml) |
| 4 | +[](https://github.com/avnlp/rag-pipelines/actions/workflows/tests.yml) |
| 5 | +[](https://codecov.io/github/avnlp/rag-pipelines) |
3 | 6 |  |
4 | 7 |
|
| 8 | +This repository contains advanced Retrieval-Augmented Generation (RAG) pipelines specifically designed for domain-specific tasks. |
| 9 | + |
| 10 | +The RAG pipelines follow a standardized architecture: |
| 11 | + |
| 12 | +- [**LangGraph**](https://www.langchain.com/langgraph) for workflow orchestration. |
| 13 | +- [**Unstructured**](https://unstructured.io/) for document processing. |
| 14 | +- [**Milvus**](https://milvus.io/) vector database for hybrid search and retrieval. |
| 15 | +- [**DeepEval**](https://deepeval.com/) for comprehensive evaluation metrics. |
| 16 | +- [**Confident AI**](https://www.confident-ai.com/) for tracing and debugging. |
| 17 | + |
| 18 | +Each pipeline is configured through YAML files that allow for flexible customization of document processing, retrieval strategies, and generation parameters. |
| 19 | + |
5 | 20 | ## Datasets |
6 | 21 |
|
7 | | -We evaluate the RAG pipelines on the following datasets: |
| 22 | +The project includes several domain-specific datasets: |
| 23 | + |
| 24 | +- [**HealthBench**](https://openai.com/index/healthbench/): A comprehensive benchmark for evaluating medical AI systems with multi-turn conversations and expert rubric evaluations. |
| 25 | +- [**MedCaseReasoning**](https://github.com/kevinwu23/Stanford-MedCaseReasoning): Dataset containing medical case studies with detailed reasoning processes. |
| 26 | +- [**MetaMedQA**](https://github.com/maximegmd/MetaMedQA-benchmark): Medical question answering dataset based on USMLE textbook content. |
| 27 | +- [**PubMedQA**](https://pubmedqa.github.io/): Biomedical question answering dataset based on PubMed articles. |
| 28 | +- [**FinanceBench**](https://github.com/patronus-ai/financebench): FinanceBench is a question answering dataset that comprises of questions from public filings including 10Ks, 10Qs, 8Ks, and Earnings Calls. |
| 29 | +- [**Earnings Calls**](https://huggingface.co/datasets/lamini/earnings-calls-qa): Financial question answering dataset based on Earnings Call Transcripts of over 2800 companies. |
| 30 | + |
| 31 | +## Pipeline Architecture |
| 32 | + |
| 33 | +Each pipeline follows a consistent architecture with the following nodes: |
| 34 | + |
| 35 | +- **Indexing**: Processes raw documents from the dataset, chunks them using unstructured.io data processors, extracts metadata using LLM-powered extraction, and stores them in a Milvus vector database with BM25 hybrid search capability. The indexing process also applies metadata schemas to ensure consistent metadata across documents. |
| 36 | + |
| 37 | +- **Metadata Extraction**: Uses an LLM-powered extractor to parse the input question and produce a structured filter dictionary based on a predefined JSON schema. This output is used to constrain document retrieval to relevant subsets (e.g., by publication year, study type, etc.) to improve retrieval precision. |
| 38 | + |
| 39 | +- **Document Retrieval**: Retrieves relevant documents using a configured retriever (typically from Milvus vector store) based on the input question and optional metadata filter from the previous step. The retrieved documents are converted into both raw Document objects and plain text for downstream use. |
| 40 | + |
| 41 | +- **Document Reranking**: Reranks the retrieved documents based on their relevance to the query using a specialized contextual reranker model. This step improves the relevance of documents used for answer generation by reordering them according to their contextual similarity to the query. |
| 42 | + |
| 43 | +- **Answer Generation**: Generates an answer using an LLM conditioned on the retrieved context. This node uses a pre-defined prompt template that injects the context and question into the LLM call, producing a raw string response that forms the final answer. |
| 44 | + |
| 45 | +- **Evaluation**: Evaluates the generated response against the ground truth using DeepEval metrics. This node constructs an LLMTestCase from the ground truth, generated answer, and retrieved context, then runs a suite of pre-configured metrics including contextual recall, precision, relevancy, answer relevancy, and faithfulness. |
| 46 | + |
| 47 | +## Components |
| 48 | + |
| 49 | +### Contextual Ranker |
| 50 | + |
| 51 | +The ContextualReranker uses the reranker models by [Contextual AI](https://contextual.ai/blog/introducing-instruction-following-reranker) to reorder documents based on their relevance to a given query. |
| 52 | + |
| 53 | +- Uses the contextual-rerank models from HuggingFace for reranking. |
| 54 | +- Supports custom instructions to refine query context during reranking. |
| 55 | +- Uses model logits for scoring document relevance. |
| 56 | +- Preserves document metadata during reranking. |
| 57 | + |
| 58 | +### Metadata Extractor |
| 59 | + |
| 60 | +The MetadataExtractor extracts structured metadata from text using a language model and a user specified JSON schema. |
| 61 | + |
| 62 | +- Uses LLMs with structured-output generation for metadata extraction. |
| 63 | +- Dynamically converts JSON schema into Pydantic models for type safety and validation. |
| 64 | +- Only includes successfully extracted (non-null) fields in results. |
| 65 | +- Supports string, number, and boolean field types with optional enums. |
| 66 | + |
| 67 | +### Unstructured Document Loaders and Chunker |
8 | 68 |
|
9 | | -| Dataset | Description | |
10 | | -| :--- | :--- | |
11 | | -| **HealthBench** | A comprehensive benchmark for evaluating medical AI, featuring multi-turn conversations and expert assessments. | |
12 | | -| **MedCaseReasoning** | A collection of medical case studies that include detailed, step-by-step reasoning processes. | |
13 | | -| **MetaMedQA** | A medical question-answering dataset where contexts are sourced from USMLE textbooks. | |
14 | | -| **PubMedQA** | A biomedical question-answering dataset derived from abstracts in PubMed articles. | |
| 69 | +**UnstructuredAPIDocumentLoader**: Loads and transforms documents using the Unstructured API. It supports extracting text, tables, and images from various document formats. |
15 | 70 |
|
| 71 | +**UnstructuredDocumentLoader**: Loads and transforms PDF documents using the Unstructured API with various processing strategies. |
16 | 72 |
|
17 | | -## Developing |
| 73 | +**UnstructuredChunker**: Chunks documents using different strategies from the `unstructured` library, supporting "basic" and "by_title" chunking approaches. |
18 | 74 |
|
19 | | -### Installing dependencies |
| 75 | +- Support for multiple document formats (PDF, DOCX, PPTX, etc.). |
| 76 | +- Various processing strategies (hi_res, auto, fast). |
| 77 | +- Configurable chunking with overlap and size parameters. |
| 78 | +- Metadata preservation during document processing. |
| 79 | +- Recursive directory processing for batch document loading. |
20 | 80 |
|
21 | | -The development environment can be set up using |
22 | | -[uv](https://github.com/astral-sh/uv?tab=readme-ov-file#installation). Hence, make sure it is |
23 | | -installed and then run: |
| 81 | +## Installation |
| 82 | + |
| 83 | +The project uses [uv](https://github.com/astral-sh/uv) for dependency management. First, ensure uv is installed: |
| 84 | + |
| 85 | +```bash |
| 86 | +# Install uv (if not already installed) |
| 87 | +pip install uv |
| 88 | +``` |
| 89 | + |
| 90 | +Then install the project dependencies: |
24 | 91 |
|
25 | 92 | ```bash |
| 93 | +# Install dependencies |
26 | 94 | uv sync |
| 95 | + |
| 96 | +# Activate the virtual environment |
27 | 97 | source .venv/bin/activate |
28 | 98 | ``` |
29 | 99 |
|
30 | | -In order to install dependencies for testing (codestyle, unit tests, integration tests), |
31 | | -run: |
| 100 | +## Usage |
| 101 | + |
| 102 | +### Environment Setup |
| 103 | + |
| 104 | +Create a `.env` file in the project root with the required environment variables: |
| 105 | + |
| 106 | +```env |
| 107 | +GROQ_API_KEY=your_groq_api_key |
| 108 | +MILVUS_URI=your_milvus_uri |
| 109 | +MILVUS_TOKEN=your_milvus_token |
| 110 | +UNSTRUCTURED_API_KEY=your_unstructured_api_key |
| 111 | +``` |
| 112 | + |
| 113 | +### Indexing |
| 114 | + |
| 115 | +Each dataset module includes an indexing script to process and store documents in the vector database: |
| 116 | + |
| 117 | +Example for HealthBench: |
32 | 118 |
|
33 | 119 | ```bash |
34 | | -uv sync --dev |
35 | | -source .venv/bin/activate |
| 120 | +cd src/rag_pipelines/healthbench |
| 121 | +python healthbench_indexing.py |
36 | 122 | ``` |
37 | 123 |
|
38 | | -In order to exclude installation of packages from a specific group (e.g. docs), |
39 | | -run: |
| 124 | +### RAG Evaluation |
| 125 | + |
| 126 | +Each dataset module includes a RAG evaluation script to test the pipeline performance: |
| 127 | + |
| 128 | +Example for HealthBench: |
40 | 129 |
|
41 | 130 | ```bash |
42 | | -uv sync --no-group docs |
| 131 | +cd src/rag_pipelines/healthbench |
| 132 | +python healthbench_rag.py |
43 | 133 | ``` |
| 134 | + |
| 135 | +## Contributing |
| 136 | + |
| 137 | +Please see the [CONTRIBUTING.md](CONTRIBUTING.md) file for detailed contribution guidelines. |
| 138 | + |
| 139 | +## License |
| 140 | + |
| 141 | +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
0 commit comments