Retrieval prototype

Task 📖

Given a question, you have to find the best wikipedia article that answers it.

Approach 🔍

For this task, we will primarily explore semantic search with the following models:

Baseline: sentence-transformers/all-mpnet-base-v2 (109M)

Recommended by the SBERT documentation to provide the best quality.
Small size (109M parameters) makes it computationally efficient and serve as a baseline

Large-Scale Model: sentence-transformers/gtr-t5-xxl (4.86B)

A powerful model used by Meta DPR
Used to test the hypothesis that a larger model is better able to capture document chunk semantics, leading to improved retrieval.
Cons: Extremely large and requires significant resources; index creation takes nearly 2 hours on an A100 GPU.

Mid-Sized Model: BAAI/bge-large-en-v1.5 (335M)

Strong performance on the MTEB leaderboard.
Smaller than gtr-t5-xxl but expected to be competitive in retrieval.
Aims to provide an efficient alternative to the large-scale model.

In this project, we use a mixture of Langchain and FAISS to perform retrieval. The index type is IndexFlatL2 and similarity metric is euclidean distance.

Evaluation metrics 📐

In this project, we will use top-1 accuracy to measure as the train and test set only specify one best wikipedia article. We also visually inspect a few samples from the retrieved results.

Optimizations ⏳

Reranking

Motivated by the observation from single embedding experiments that gold article is somewhere in the top 5 retrieved documents
Uses a second-stage reranker (BAAI/bge-reranker-large) to refine the ordering of retrieved documents and improve the likelihood of the best article ranking at position #1.

Hybrid search

Helps address cases where semantic models alone may not retrieve the most relevant article.
Combines semantic search with BM25 to leverage both lexical and semantic signals.

Results ✏️

Table 1: Evaluation results from train.jsonl.

Method	Accuracy
Single Retriever
`all-mpnet-base-v2`	48.12%
`gtr-t5-xxl`	52.57%
`bge-large-en-v1.5`	53.98%
Hybrid Search
`gtr-t5-xxl + BM25`	47.74%
`bge-large-en-v1.5 + BM25`	50.01%
Reranking
`gtr-t5-xxl + bge-reranker-large`	56.47%
`bge-large-en-v1.5 + bge-reranker-large`	53.98%

From the results, we see that using bge-large-en-v1.5 to retrieve the top 15 documents before reranking them using bge-reranker-large yields the best results of 57.96% accuracy.

Findings

Reranking boosts retrieval performance substantially. Adding a reranker improves gtr-t5-xxl performance by approximately 4% (52.57% to 56.47%) and bge-large-en-v1.5 performance by approximately 4% (53.98% to 53.98%).
Using bigger model does not necessarily yield better results. At 4.86B, gtr-t5-xxl performs worse than bge-large-en-v1.5 which only has 335M. Time to build index is: all-mpnet-base-v2 (~10min) < bge-large-en-v1.5 (~30min) < bge-large-en-v1.5 (~2h). The time and performance tradeoff for gtr-t5-xxl is not good.
Hybrid search does not improve the search performance but rather degrades it slightly. Adding BM25 to gtr-t5-xxl degrades performance from 52.57% to 47.74%. Similarly, when BM25 is add for bge-large-en-v1.5, performance drops from 53.98% to 50.01%.

Other observations

There seems to be some noise in the dataset, there is more than one article that can answer the question. For example:

"query": "what is the name of the largest city in romania?",
"gold_article": "Bucharest",
"points": 52,
"retrieved_docs": [
    {
        "rank": 1,
        "title": "Romania",
        "content": "Religion\nRomania is a secular state. This means Romania has no national religion. The biggest religious group in Romania is the Romanian Orthodox Church. It is an autocephalous church inside of the Eastern Orthodox communion. In 2002, this religion made up 86.7% of the population. Other religions in Romania include Roman Catholicism (4.7%), Protestantism (3.7%), Pentecostalism (1.5%) and the Romanian Greek-Catholicism (0.9%).\n\nCities\n\nBucharest is the capital of Romania. It also is the biggest city in Romania, with a population of over 2 millions peoples.\n\nThere are 5 other cities in Romania that have a population of more than 300,000 people. These are Iaşi, Cluj-Napoca, Timişoara, Constanţa, and Craiova. Romania also has 5 cities that have more than 200,000 people living in them: Galaţi, Braşov, Ploieşti, Brăila, and Oradea.\n\nThirteen other cities in Romania have a population of more than 100,000 people.\n\nEconomy",
        "distance": 0.49112021923065186
    },

The retrieved article ("Romania") contains the answer, but it isn't the exact expected document.
This suggests that Top-5 accuracy or LLM-based evaluation might better capture retrieval effectiveness. However, for the sake of this task, we will stick to top-1 accuracy.

Installation & Usage 🛠️

Installation

conda create --name faiss_1.8.0 python=3.10
conda activate faiss_1.8.0
conda install -c pytorch -c nvidia faiss-gpu=1.8.0 pytorch=*=*cuda* pytorch-cuda=12 numpy

Note: please adjust based on your CUDA version.

pip install -r requirements.txt

Running experiments

CUDA_VISIBLE_DEVICES=0 python main_experiments.py --config config_file_path

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
logger.py		logger.py
main_experiments.py		main_experiments.py
requirements.txt		requirements.txt
submission.jsonl		submission.jsonl
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval prototype

Task 📖

Approach 🔍

Evaluation metrics 📐

Optimizations ⏳

Results ✏️

Installation & Usage 🛠️

About

Uh oh!

Releases

Packages

Uh oh!

Languages

shanghongsim/retrieval_prototype

Folders and files

Latest commit

History

Repository files navigation

Retrieval prototype

Task 📖

Approach 🔍

Evaluation metrics 📐

Optimizations ⏳

Results ✏️

Installation & Usage 🛠️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages