A comprehensive framework for evaluating analytical queries over multimodal data, containing the experiments for the paper "Analytical queries over multimodal data".
This repository provides a modular evaluation framework for comparing different reasoning systems on multimodal question-answering tasks. The framework supports various Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) systems, and evaluation metrics to analyze performance on analytical queries across different data modalities.
# Create conda environment from environment.yml
conda env create -f environment.yml
conda activate multimodal-analytics
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Create a .env
file in the root directory with your API keys:
OPENAI_API_KEY=your_openai_api_key_here
TOGETHER_API_KEY=your_together_api_key_here
# Add other API keys as needed
To reproduce the results from the paper, you can use the pre-computed results available in the output/results/
directory. This allows you to verify the evaluation metrics without needing to rerun the entire evaluation process.
In the plots
directory, you can find a notebook rag_cardinality.ipynb
to generate the plots from the pre-computed results.
Edit the notebook to point to the correct results directory if needed.
The main entry point is evaluate.py
. Here are some example commands:
# Run vanilla RAG evaluation on stirpot-300 dataset
python evaluate.py --method vanilla_rag --usecase stirpot-300 --evaluator f1 --topk 50
# Evaluate GPT-4 with verbose output
python evaluate.py --method gpt4 --usecase stirpot-300 --evaluator llm --verbose
# If you only want to run the evaluation, run with system cache enabled to reuse previous results
python evaluate.py --method llama3_1_8b --usecase stirpot-300 --system-cache --verbose
# Limit queries for quick testing
python evaluate.py --method mixtral --usecase stirpot-300 --query-limit 10
System | Description |
---|---|
vanilla_rag |
Basic Retrieval-Augmented Generation |
canvas_retrieve_context |
Contextual RAG system |
gpt3 |
GPT-3.5 Turbo |
gpt4 |
GPT-4 |
llama3_1_8b |
Llama 3.1 8B parameter model |
llama3_1_70b |
Llama 3.1 70B parameter model |
llama3_3 |
Llama 3.3 model |
gemma3 |
Google Gemma 3 |
mixtral |
Mixtral model |
Evaluator | Description |
---|---|
f1 |
F1 Score evaluation |
precision |
Precision metric |
recall |
Recall metric |
breakdown |
Detailed breakdown analysis |
llm |
LLM-based evaluation |
- stirpot-300: A dataset with 300 questions for multimodal analytics
--method
: Choose the reasoning system to evaluate--usecase
: Select the dataset/workload--evaluator
: Choose evaluation metric (default: llm)--topk
: Number of top-k documents for RAG retrieval (default: 50)--query-limit
: Limit number of queries to process--system-cache
: Enable system-level caching--result-cache
: Use cached evaluation results if available--verbose
: Enable detailed output
Evaluation results are automatically saved in the output/results/{system}/
directory. Each result file contains:
- Question-answer pairs with predictions
- Evaluation metrics and scores
- Timestamp and configuration details
Results are saved in JSON format with filenames like:
{usecase}_{system}_{timestamp}.json
For RAG systems, the top-k parameter is included:
{usecase}_{system}_topk{k}_{timestamp}.json
The framework is organized into several key components:
- Reasoners: Core reasoning systems (LLMs, RAG)
- Evaluators: Evaluation metrics and methods
- Retrievers: Document retrieval systems
- Chunkers: Text chunking strategies
- Featurizers: Feature extraction methods
- Storage: Vector storage backends (FAISS, etc.)
The framework supports various data formats:
- PDF documents (via
pdf_chunker.py
) - CSV files (via
csv_chunker.py
) - Markdown documents (via
markdown_chunker.py
) - Contextual chunking to improve retrieval
- System Cache: Caches intermediate results from reasoning systems
- Result Cache: Stores final evaluation results for reuse
- Automatic cache management with timestamp-based organization
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this framework in your research, please cite:
@article{multimodal_analytics,
title={Analytical queries over multimodal data},
author={MIT DB Group},
year={2025}
}
For questions or issues, please open a GitHub issue or contact [email protected]