Skip to content

xhluca/bm25-benchmarks

Repository files navigation

BM25 Benchmarks

Benchmarking

To run benchmark on bm25 implementations, simply run:

# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"

# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"

# for Pyserini
python -m benchmark.on_pyserini -d "<dataset>"

# For elastic, After starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"

# for PISA
python -m benchmark.on_pisa -d "<dataset>"

where <dataset> is the name of the dataset to be used.

Available datasets

The available datasets are public BEIR datasets: trec-covid, nfcorpus, fiqa, arguana, webis-touche2020, quora, scidocs, scifact, cqadupstack, nq, msmarco, hotpotqa, dbpedia-entity, fever, climate-fever,

Sampling during benchmarking

For rank-bm25, due to the long runtime, we can sample queries

python -m benchmark.on_rank_bm25 -d "<dataset>" --samples <num_samples>

Rank-bm25 variants

For rank-bm25, we can also specify the method with --method to be used:

  • rank (default)
  • bm25l
  • bm25+

Results will be saved in results/ directory.

Elasticsearch server

If you want to use elastic search, you need to start the server first.

First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz. Extract the file and ensure it is in the same directory as the bm25-benchmarks directory.

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz

Then, start the server with the following command:

./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1

Results

The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.

The shorthands used are:

  • BM25PT for bm25_pt
  • PSRN for pyserini
  • R-BM25 for rank-bm25
  • BM25S for bm25, and BM25S+J for Numba JIT version of bm25s (v0.2.0+)
  • ES for elasticsearch
  • PISA for the Pisa Engine (via the pyterrier_pisa Python bindings)
  • OOM for out-of-memory error
  • DNT for did not terminate (i.e. went over 12 hours)

Queries per second

dataset PISA BM25S+J BM25S ES PSRN PT R-BM25
arguana 270.53 869.95 573.91 13.67 11.95 110.51 2
climate-fever 35.95 38.49 13.09 4.02 8.06 OOM 0.03
cqadupstack 362.39 396.5 170.91 13.38 DNT OOM 0.77
dbpedia-entity 197.45 71.8 13.44 10.68 12.69 OOM 0.11
fever 81.42 53.84 20.19 7.45 10.52 OOM 0.06
fiqa 714.35 1237.39 717.78 16.96 12.51 20.52 4.46
hotpotqa 54.98 47.16 20.88 7.11 10.41 OOM 0.04
msmarco 178.65 39.18 12.2 11.88 11.01 OOM 0.07
nfcorpus 5111.72 5696.21 1196.16 45.84 32.94 256.67 224.66
nq 168.12 109.47 41.85 12.16 11.04 OOM 0.1
quora 735.20 479.71 272.04 21.8 15.58 6.49 1.18
scidocs 818.97 1448.32 767.05 17.93 14.1 41.34 9.01
scifact 1463.73 2787.84 1317.12 20.81 15.02 184.3 47.6
trec-covid 282.94 483.84 85.64 7.34 8.53 3.73 1.48
webis-touche2020 431.12 390.03 60.59 13.53 12.36 OOM 1.1

Notes:

  • For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
  • For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.
Show BM25S & ES multi-threaded (4T) performance (Q/s)
dataset PISA BM25S ES
arguana 590.93 211 33.37
climate-fever 91.68 22.06 8.13
cqadupstack 945.66 248.87 27.76
dbpedia-entity 478.26 26.18 15.49
fever 222.08 47.03 14.07
fiqa 1382.32 449.82 36.33
hotpotqa 134.60 45.02 10.35
msmarco 393.16 21.64 18.19
nfcorpus 6706.53 784.24 81.07
nq 423.54 77.49 19.18
quora 1892.98 308.58 43.02
scidocs 1757.44 614.23 46.36
scifact 2480.86 645.88 50.93
trec-covid 676.40 100.88 13.5
webis-touche2020 938.57 202.39 26.55
Show normalized table wrt Rank-BM25
dataset PISA BM25S ES PSRN PT Rank
arguana 135.27 286.96 6.84 5.98 55.26 1
climate-fever 1198.33 436.33 134 268.67 nan 1
cqadupstack 470.64 221.96 17.38 nan nan 1
dbpedia-entity 1795.00 122.18 97.09 115.36 nan 1
fever 1357.00 336.5 124.17 175.33 nan 1
fiqa 160.17 160.94 3.8 2.8 4.6 1
hotpotqa 1374.50 522 177.75 260.25 nan 1
msmarco 2552.14 174.29 169.71 157.29 nan 1
nfcorpus 22.75 5.32 0.2 0.15 1.14 1
nq 1681.20 418.5 121.6 110.4 nan 1
quora 623.05 230.54 18.47 13.2 5.5 1
scidocs 90.90 85.13 1.99 1.56 4.59 1
scifact 30.75 27.67 0.44 0.32 3.87 1
trec-covid 191.18 57.86 4.96 5.76 2.52 1
webis-touche2020 391.93 55.08 12.3 11.24 nan 1

Stats

# Docs # Queries # Tokens
msmarco 8,841,823 6,980 340,859,891
hotpotqa 5,233,329 7,405 169,530,287
trec-covid 171,332 50 20,231,412
webis-touche2020 382,545 49 74,180,340
arguana 8,674 1,406 947,470
fiqa 57,638 648 5,189,035
nfcorpus 3,633 323 614,081
climate-fever 5,416,593 1,535 318,190,120
nq 2,681,468 3,452 148,249,808
scidocs 25,657 1,000 3,211,248
quora 522,931 10,000 4,202,123
dbpedia-entity 4,635,922 400 162,336,256
cqadupstack 457,199 13,145 44,857,487
fever 5,416,568 6,666 318,184,321
scifact 5,183 300 812,074

Indexing time (docs/s)

The following results follow the same setup as the queries/s benchmarks described above (single-core).

dataset PISA BM25S ES PSRN PT Rank
arguana 3432.50 4314.79 3591.63 1225.18 638.1 5021.3
climate-fever 5462.73 4364.43 3825.89 6880.42 nan 7085.51
cqadupstack 3963.76 4800.89 3725.43 nan nan 5370.32
dbpedia-entity 9019.62 7576.28 6333.82 8501.7 nan 9110.36
fever 4903.06 4921.88 3879.63 7007.5 nan 5482.64
fiqa 4426.92 5959.25 4035.11 3735.38 421.51 6455.53
hotpotqa 9883.85 7420.39 5455.6 10342.5 nan 9407.9
msmarco 10205.53 7480.71 5391.29 9686.07 nan 12455.9
nfcorpus 2381.11 3169.4 1688.15 692.05 442.2 3579.47
nq 7122.05 6083.86 5742.13 6652.33 nan 6048.85
quora 38512.02 28002.4 8189.75 22818.5 6251.26 47609.2
scidocs 3085.13 4107.46 3008.45 2137.64 312.72 4232.15
scifact 2449.91 3253.63 2649.57 880.53 442.61 3792.84
trec-covid 4642.59 4600.14 2966.98 3768.1 406.37 4672.62
webis-touche2020 2228.10 2971.96 2484.87 2718.41 nan 3115.96

NDCG@10

We use abbreviations for datasets of BEIR benchmarks.

Click to show dataset abbreviations
  • AG for arguana
  • CD for cqadupstack
  • CF for climate-fever
  • DB for dbpedia-entity
  • FQ for fiqa
  • FV for fever
  • HP for hotpotqa
  • MS for msmarco
  • NF for nfcorpus
  • NQ for nq
  • QR for quora
  • SD for scidocs
  • SF for scifact
  • TC for trec-covid
  • WT for webis-touche2020
k1 b method Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
0.9 0.4 Lucene 41.1 40.8 28.2 16.2 31.9 23.8 63.8 62.9 22.8 31.8 30.5 78.7 15.0 67.6 58.9 44.2
1.2 0.75 ATIRE 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25+ 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25L 39.5 49.6 29.8 13.5 29.4 25.0 46.6 55.9 21.4 32.2 28.1 80.3 15.8 68.7 62.9 33.0
1.2 0.75 Lucene 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.0 61.0 33.2
1.2 0.75 Robertson 39.9 49.2 29.9 13.7 30.3 25.4 50.3 58.5 22.6 31.9 29.2 80.4 15.5 68.3 59.0 33.8
1.5 0.75 ES 42.0 47.7 29.8 17.8 31.1 25.3 62.0 58.6 22.1 34.4 31.6 80.6 16.3 69.0 68.0 35.4
1.5 0.75 Lucene 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
1.5 0.75 PSRN 40.0 48.4 29.8 14.2 30.0 25.3 50.0 57.6 22.1 32.6 28.6 80.6 15.6 68.8 63.4 33.5
1.5 0.75 PT 45.0 44.9 -- -- -- 22.5 -- -- -- 31.9 -- 75.1 14.7 67.8 58.0 --
1.5 0.75 Rank 39.6 49.5 29.6 13.6 29.9 25.3 49.3 58.1 21.1 32.1 28.5 80.3 15.8 68.5 60.1 32.9
1.2 0.75 PISA 38.8 41.1 27.8 13.9 30.5 24.5 49.2 58.2 22.8 34.3 28.2 72.0 15.7 68.9 64.2 30.9

Recall@1000

k1 b method Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
0.9 0.4 Lucene 77.3 98.8 71.1 63.3 67.5 74.3 95.7 88.0 85.3 47.7 89.6 99.5 56.5 97.0 39.2 86.0
1.2 0.75 ATIRE 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.7 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 BM25+ 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.7 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 BM25L 77.2 99.4 73.4 57.3 66.1 77.3 93.7 85.7 85.0 47.7 89.3 99.5 57.7 97.0 40.8 87.5
1.2 0.75 Lucene 77.4 99.3 73.0 59.0 67.0 76.5 94.2 86.8 85.6 47.8 89.8 99.5 57.3 97.0 40.3 87.2
1.2 0.75 Robertson 77.4 99.3 73.2 59.1 66.7 76.8 94.2 86.8 85.9 47.5 89.8 99.5 57.3 96.7 40.2 87.4
1.5 0.75 ES 76.9 99.2 74.2 58.8 63.6 76.7 95.9 85.2 85.1 39.0 90.8 99.6 57.9 98.0 41.3 88.0
1.5 0.75 Lucene 77.2 99.3 73.3 57.8 66.3 77.2 93.8 86.1 85.2 47.7 89.5 99.6 57.5 97.0 40.6 87.4
1.5 0.75 PSRN 76.7 99.2 74.2 58.7 66.2 76.7 94.2 86.4 85.1 37.1 89.4 99.6 57.4 97.7 41.1 87.2
1.5 0.75 PT 73.0 98.3 -- -- -- 72.5 -- -- -- 51.0 -- 98.9 56.0 97.8 36.3 --
1.5 0.75 Rank 77.1 99.4 73.4 57.5 66.4 77.4 93.6 87.7 82.6 47.6 89.5 99.5 57.4 96.7 40.5 87.5
1.2 0.75 PISA 77.1 98.7 72.2 60.2 67.7 76.5 93.7 86.8 86.9 38.4 89.1 98.9 56.9 97.0 45.9 87.4

Links

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages