mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs

This repository provides the code for the paper "mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs".

🔗 Data

The mmRAG dataset is avaliable at Hugging Face.

mmRAG Overview

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components—such as the accuracy of retrieval and query routing—beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.

In this repository, we provide our source code for constructing mmRAG dataset and evaluating experienmens provided in our paper.

🧪 Experiments

We provide our dataset and annotation results in data/ folder. The source codes of experiments are in mmrag_experiments/ folder.

Retrieval:

Setup:

We use BM25, Contriever, DPR, bge-large-en-v1.5, gte-large-en-v1.5 as example retrievers. For BM25, we need to use another python environment with JDK and pyserini.

For other retrievers, install all required packages by using pip install -r retrieval_requirements.txt

Build Index and Retrieve

for each dense retriever, we need to build index for our chunks. In mmrag_experiments/retrievers/{your_retriever}, there are two python scripts build_index.py and retrieve.py. Run build_index.py to generate embeddings and faiss index in ./cache/ in each folder, and then run retrieve.py to generate retrieval result result_{your_retriever}.json

Fine-tune models

Fine-tune models can be trained using FlagEmbedding.

Evaluation

You can run mmrag_experiments/eval.py to see the evaluations of retrievers. In our experiment, we got the result in following graph:

Generation:

You can run mmrag_experiments/exp/generation_exp.py to run the generation experiment. In load_test_data(). Set LLM like glm_model = llm.GLM("glm-4-plus"), and remember to fill information like api-key in llm.py.

Generation with spcific-dataset chunks:

Also in mmrag_experiments/exp/generation_exp.py, in load_test_data(), you can set if to set dataset limitation.

Generation with query router:

You can do query routing tests in mmrag_experiments/exp/routing. We provide 3 basic router as example: mmrag_experiments/exp/routing/oracle_router.py, mmrag_experiments/exp/routing/llm_router.py, and mmrag_experiments/exp/routing/semantic_router.py, where the oracle_router uses dataset-level labels in mmRAG. You can implement your custom router and save the routing result in corresponsing json files. After routing results are generated, generation experiments can be done in generate_exp.py.

🧾 mmRAG Construction

Note: Due to randomness of LLM generation, the queries selected may not completely the same with our current dataset.

Source Dataset download

In mmRAG, we use NQ, TriviaQA, OTT, TAT, CWQ, WebQSP as data sources.

NQ and TriviaQA can be accessed from HuggingFace, using load_dataset() provided by dataset package.

OTT and TAT datasets can be downloaded from there websites.

CWQ, WebQSP, and the Knowledge Graph (Freebase) can be downloaded and setup following ChatKBQA:

dataset_loader.py provides the basic functions to load and process the datasets. After your downloading all these datasets and start the Freebase SPARQL dump, modify all the dataset_path and path in load_dataset function in the corresponding class. For CWQ_reader and WebQSP_reader, the kb_dump path also need to be filled. You can try to run dataset_loader.py to test whether all datasets are correctly loaded.

Query Selection, Documents Representation and Collection

Run filter.py to generate queries for each source dataset and corresponding documents and ramdom chaos socuments.

Run add_chaos_and_process_kg_documents.py to chunk documents.

Pooling

Note: BM25 needs different python environment with BGE.

BM25

Using pyserini and java JDK environment, run codes in ``mmrag_construction/pooling/bm25: first mmrag_construction/pooling/bm25/file_perpare.py`, then `bash generate_index.sh`, and `mmrag_construction/pooling/bm25/bm25query.py`.

BGE

Run mmrag_construction/pooling/bge/bge_query.py with BM25 results.

LLM Annotation

Run mmrag_construction/llm_marker/mark.py, the chunk-level annotation will be generated.

Final processing

After LLM Annotation, we run mmrag_construction/generate_dataset/add_dataset_score.py to generate the dataset-level relevance labels. Then run mmrag_construction/generate_dataset/gen_dataset.py to generate json file of dataset, use partition.py to make a train-test-dev split.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data/annotation_results		data/annotation_results
figs		figs
mmrag_construction		mmrag_construction
mmrag_experiments		mmrag_experiments
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs

🔗 Data

mmRAG Overview

🧪 Experiments

Retrieval:

Setup:

Build Index and Retrieve

Fine-tune models

Evaluation

Generation:

Generation with spcific-dataset chunks:

Generation with query router:

🧾 mmRAG Construction

Source Dataset download

Query Selection, Documents Representation and Collection

Pooling

BM25

BGE

LLM Annotation

Final processing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

nju-websoft/mmRAG

Folders and files

Latest commit

History

Repository files navigation

mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs

🔗 Data

mmRAG Overview

🧪 Experiments

Retrieval:

Setup:

Build Index and Retrieve

Fine-tune models

Evaluation

Generation:

Generation with spcific-dataset chunks:

Generation with query router:

🧾 mmRAG Construction

Source Dataset download

Query Selection, Documents Representation and Collection

Pooling

BM25

BGE

LLM Annotation

Final processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages