ReQAP

Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Description
Code
Feedback
License
Acknowledgements

Description

This repository contains the code for our ACL 2025 (Findings) paper on "Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data".

Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.

If you use this code, please cite:

@inproceedings{christmann2025recursive,
  title={Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data},
  author={Christmann, Philipp and Weikum, Gerhard},
  booktitle={ACL 2025 Findings},
  year={2025}
}

Code

System requirements

All code was tested on Linux only.

Conda
PyTorch
GPU (for training)

Installation

We recommend the installation via conda, and provide the corresponding environment file in conda-reqap.yml:

    git clone https://github.com/PhilippChr/ReQAP.git
    cd ReQAP/
    conda env create --file conda-reqap.yml
    conda activate reqap
    pip install -e .

Alternatively, you can also install the requirements via pip, using the requirements.txt or requirements-cpu.txt file. In this case, for running the code via GPU, further packages might be required.

To initialize the repo (download data, benchmark, models), run:

bash scripts/initialize.sh

ReQAP - Inference

ReQAP SFT

Run QUD stage

bash scripts/pipeline.sh --qud-test config/perqa/reqap_sft.yml  # much faster with GPU

Run OTX stage

bash scripts/pipeline.sh --otx-test config/perqa/reqap_sft.yml  # much faster with GPU

ReQAP with LLaMA

Run QUD stage

bash scripts/pipeline.sh --qud-test config/perqa/reqap_llama.yml  # much faster with GPU

Run OTX stage

bash scripts/pipeline.sh --otx-test config/perqa/reqap_llama.yml  # much faster with GPU

ReQAP with GPT

Add your OpenAI credentials in config/perqa/reqap_openai.yml

Run QUD stage

bash scripts/pipeline.sh --qud-test config/perqa/reqap_openai.yml

Run OTX stage

bash scripts/pipeline.sh --otx-test config/perqa/reqap_openai.yml  # much faster with GPU

ReQAP - Full training procedure

This requires adding your OpenAI credentials in config/perqa/reqap_openai.yml, or replacing config/perqa/reqap_openai.yml with config/perqa/reqap_llama.yml.

Run QUD stage via ICL on all train + dev questions

Run QUD-ICL on train set

bash scripts/pipeline.sh --create_qu_plans-train config/perqa/reqap_openai.yml  # requires GPU/API

Run QUD-ICL on dev set

bash scripts/pipeline.sh --create_qu_plans-dev config/perqa/reqap_openai.yml  # requires GPU/API

Train RETRIEVE and EXTRACT operators

RETRIEVE

Detect all RETRIEVE calls

bash scripts/run_retrieval.sh --derive_retrieve_calls config/perqa/reqap_openai.yml

Construct SPLADE indices

bash scripts/run_retrieval.sh --construct_index config/perqa/reqap_openai.yml  # much faster with GPU

Prepare data for training RETRIEVE models

bash scripts/prepare_retrieval_data.sh config/perqa/reqap_openai.yml  # CPU: runs 14 parallel scripts

Merge RETRIEVE training data for all personas

bash scripts/merge_retrieval_data.sh data/training_data/perqa

Train RETRIEVE models of size L (default)

bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-12.yml  # requires GPU
bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-12.yml  # requires GPU

[OPTIONAL] Train RETRIEVE models of size M

bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-6.yml  # requires GPU
bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-6.yml  # requires GPU

[OPTIONAL] Train RETRIEVE models of size S

bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-MiniLM-L-2.yml  # requires GPU
bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-MiniLM-L-2.yml  # requires GPU

[OPTIONAL] Train RETRIEVE models of size XS

bash scripts/run_retrieval.sh --train_ce_events config/perqa/training/reqap_ce_events-ms-marco-TinyBERT-L-2.yml  # requires GPU
bash scripts/run_retrieval.sh --train_ce_patterns config/perqa/training/reqap_ce_patterns-ms-marco-TinyBERT-L-2.yml  # requires GPU

EXTRACT

Derive EXTRACT calls with related attributes

bash scripts/run_extract.sh --derive_attributes config/perqa/reqap_openai.yml

Identify aliases for keys in EXTRACT calls

bash scripts/run_extract.sh --derive_attribute_mappings config/perqa/reqap_openai.yml  # requires GPU/API

Derive training data for EXTRACT model

bash scripts/run_extract.sh --derive_data config/perqa/reqap_openai.yml

Train EXTRACT model of size L (default)

bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-bart-base.yml  # requires GPU

[OPTIONAL] Train EXTRACT model of size M

bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-bart-small.yml  # requires GPU

[OPTIONAL] Train EXTRACT model of size S

bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-t5-efficient-mini.yml  # requires GPU

[OPTIONAL] Train EXTRACT model of size XS

bash scripts/run_extract.sh --train config/perqa/training/reqap_extract-t5-efficient-tiny.yml  # requires GPU

Derive data for model distillation

[OPTION 1] Identify correct operator trees (in single runs)

bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml  # requires GPU
bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml  # requires GPU

[OPTION 2] Identify correct operator trees (run individually per persona)

bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_0  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_1  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_2  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_3  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_4  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_5  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_6  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_7  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_8  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_9  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_10  # requires GPU
bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_11  # requires GPU
bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_0  # requires GPU
bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_1  # requires GPU

[OPTION 3] Identify correct operator trees for single train persona (much faster, less training data)

bash scripts/pipeline.sh --loop-train config/perqa/reqap_openai.yml train_persona_0  # requires GPU
bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_0  # requires GPU
bash scripts/pipeline.sh --loop-dev config/perqa/reqap_openai.yml dev_persona_1  # requires GPU

Merge data for model distillation
```
bash scripts/merge_qu_data.sh 
```

Train QUD stage

Derive training data for QUD stage

bash scripts/run_qu.sh --derive_data config/perqa/reqap_sft.yml

Train QUD model of size M (default)

bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-llama1b.yml  # requires GPU

[OPTIONAL] Train QUD model of size L

bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-llama3b.yml  # requires GPU

[OPTIONAL] Train QUD model of size S

bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-hf-smollm2-360M.yml  # requires GPU

[OPTIONAL] Train QUD model of size XS

bash scripts/run_qu.sh --train config/perqa/training/reqap_qu-causal-hf-smollm2-135M.yml  # requires GPU

Run ReQAP (SFT) inference

Run QUD stage

bash scripts/pipeline.sh --qud-test config/perqa/reqap_sft.yml  # much faster with GPU

Run OTX stage

bash scripts/pipeline.sh --otx-test config/perqa/reqap_sft.yml  # much faster with GPU

Baselines

RAG - Inference

RAG SFT (requires following training below)

Run retrieval

bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test  # much faster with GPU

Run generation

bash scripts/rag.sh --test config/perqa/rag_sft.yml  # requires GPU

RAG with LLaMA

Run retrieval

bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test  # much faster with GPU

Run generation

bash scripts/rag.sh --test config/perqa/rag_llama.yml  # requires GPU

RAG with GPT

Add your OpenAI credentials in config/perqa/rag_openai.yml

Run retrieval

bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test  # much faster with GPU

Run generation

bash scripts/rag.sh --test config/perqa/rag_openai.yml  # requires GPU

RAG - Full training procedure

Train retrieval for RAG baseline
- Derive training data (makes use of the ReQAP retrieval training data; assumed to be there already)
```
bash scripts/rag.sh --ce_derive_data config/perqa/rag_openai.yml
```
- Train the cross-encoder
```
bash scripts/rag.sh --ce_train config/perqa/rag_openai.yml  # requires GPU
```

Run retrieval inference

bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml train  # much faster with GPU
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml dev  # much faster with GPU
bash scripts/rag.sh --retrieve config/perqa/rag_openai.yml test  # much faster with GPU

Train answering model

Derive training data

bash scripts/rag.sh --derive_data config/perqa/rag_sft.yml

Train model

bash scripts/rag.sh --train config/perqa/rag_sft.yml  # requires GPU

Inference

bash scripts/rag.sh --test config/perqa/rag_sft.yml  # requires GPU

Query Generation - Inference

CodeGen SFT (requires following training below)

bash scripts/query_generation.sh --test config/perqa/query_generation_sft.yml # requires GPU

CodeGen with LLaMA

bash scripts/query_generation.sh --test config/perqa/query_generation_llama.yml # requires GPU

CodeGen with GPT

Add your OpenAI credentials in config/perqa/query_generation_openai.yml

bash scripts/query_generation.sh --test config/perqa/query_generation_openai.yml # requires GPU

Query Generation - Full training procedure

Prepare training data

bash scripts/query_generation.sh --derive_data config/perqa/query_generation_sft.yml

Train translation model

bash scripts/query_generation.sh --train config/perqa/query_generation_sft.yml # requires GPU

Inference

bash scripts/query_generation.sh --test config/perqa/query_generation_sft.yml # requires GPU

Feedback

We tried our best to document the code of this project, and make it accessible for easy usage. If you feel that some parts of the documentation/code could be improved, or have other feedback, please do not hesitate and let us know!

You can contact us via mail: [email protected]. Any feedback (also positive ;) ) is much appreciated!

License

The ReQAP project by Philipp Christmann and Gerhard Weikum is licensed under a MIT license.

Acknowledgements

Our retrieval utilizes SPLADE (https://github.com/naver/splade). We adapt parts of their code in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config/perqa		config/perqa
prompts		prompts
reqap		reqap
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda-reqap.yml		conda-reqap.yml
evaluate.py		evaluate.py
pipeline.py		pipeline.py
query_generation.py		query_generation.py
rag.py		rag.py
requirements-cpu.txt		requirements-cpu.txt
requirements.txt		requirements.txt
run_extract.py		run_extract.py
run_qu.py		run_qu.py
run_retrieval.py		run_retrieval.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReQAP

Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Description

Code

System requirements

Installation

ReQAP - Inference

ReQAP - Full training procedure

Baselines

RAG - Inference

RAG - Full training procedure

Query Generation - Inference

Query Generation - Full training procedure

Feedback

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

PhilippChr/ReQAP

Folders and files

Latest commit

History

Repository files navigation

ReQAP

Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Description

Code

System requirements

Installation

ReQAP - Inference

ReQAP - Full training procedure

Baselines

RAG - Inference

RAG - Full training procedure

Query Generation - Inference

Query Generation - Full training procedure

Feedback

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages