An end-to-end neural ad-hoc ranking pipeline.
OpenNIR requires Python 3.6 (not tested with other versions). Java 11 is required (for Anserini).
- OpenNIR can also be run in Docker; you can find instructions here.
Install dependencies
pip install -r requirements.txtTrain and validate a model (here, ConvKNRM on ANTIQUE):
scripts/pipeline.sh config/conv_knrm config/antique(Performance on the test set can be obtained by adding pipeline.test=True)
Grid serach for BM25 over ANTIQUE for comparision with neural model performance:
scripts/pipeline.sh config/grid_search config/antique(Performance on the test set can be obtained by adding pipeline.test=True)
Models, datasets, and vocabularies will be saved in ~/data/onir/. This can be overridden by
setting data_dir=~/some/other/place/ as a command line argument, in a configuration file, or in
the ONIR_ARGS environment variable.
- DRMM
ranker=drmmpaper - Duet (local model)
ranker=duetlpaper - MatchPyramid
ranker=matchpyramidpaper - KNRM
ranker=knrmpaper - PACRR
ranker=pacrrpaper - ConvKNRM
ranker=conv_knrmpaper - Vanilla BERT
config/vanilla_bertpaper - CEDR models
config/cedr/[model]paper - MatchZoo models source
- MatchZoo's KNRM
ranker=mz_knrm - MatchZoo's ConvKNRM
ranker=mz_conv_knrm
- MatchZoo's KNRM
- TREC Robust 2004
config/robust/fold[x] - MS-MARCO
config/msmarco - ANTIQUE
config/antique - TREC CAR
config/car - New York Times
config/nyt-- for content-based weak supervision - TREC Arabic, Mandarin, and Spanish
config/multiling/*-- for zero-shot multilingual transfer learning (instructions)
New: Any measure from the ir-measures package.
map(from trec_eval)ndcg(from trec_eval)ndcg@X(from trec_eval, gdeval)p@X(from trec_eval)err@X(from gdeval)mrr(from trec_eval)rprec(from trec_eval)judged@X(implemented in python)
- Binary term matching
vocab=binary(i.e., changes interaction matrix from cosine similarity to to binary indicators) - Pretrained word vectors
vocab=wordvecvocab.source=fasttextvocab.variant=wiki-news-300d-1M,vocab.variant=crawl-300d-2M- (information about FastText variants can be found here)
vocab=source=glovevocab.variant=cc-42b-300d,vocab.variant=cc-840b-300d- (information about GloVe variants can be found here)
vocab.source=convknrmvocab.variant=knrm-bingvocab.variant=knrm-sogou,vocab.variant=convknrm-bingvocab.variant=convknrm-sogou- (information about ConvKNRM word embedding variants can be found here)
vocab.source=bionlpvocab.variant=pubmed-pmc- (information about BioNLP variants can be found here)
- Pretrained word vectors w/ single UNK vector for unknown terms
vocab=wordvec_unk- (with above word embedding sources)
- Pretrained word vectors w/ hash-based random selection for unknown terms
vocab=wordvec_hash(defualt)- (with above word embedding sources)
- BERT contextualized embeddings
vocab=bert- Core models (from HuggingFace):
vocab.bert_base=bert-base-uncased(default),vocab.bert_base=bert-large-uncased,vocab.bert_base=bert-base-cased,vocab.bert_base=bert-large-cased,vocab.bert_base=bert-base-multilingual-uncased,vocab.bert_base=bert-base-multilingual-cased,vocab.bert_base=bert-base-chinese,vocab.bert_base=bert-base-german-cased,vocab.bert_base=bert-large-uncased-whole-word-masking,vocab.bert_base=bert-large-cased-whole-word-masking,vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad,vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad,vocab.bert_base=bert-base-cased-finetuned-mrpc - SciBERT:
vocab.bert_base=scibert-scivocab-uncased,vocab.bert_base=scibert-scivocab-cased,vocab.bert_base=scibert-basevocab-uncased,vocab.bert_base=scibert-basevocab-cased - BioBERT
vocab.bert_base=biobert-pubmed-pmc,vocab.bert_base=biobert-pubmed,vocab.bert_base=biobert-pmc
- Core models (from HuggingFace):
If you use OpenNIR, please cite the following WSDM demonstration paper:
@InProceedings{macavaney:wsdm2020-onir,
author = {MacAvaney, Sean},
title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
booktitle = {{WSDM} 2020},
year = {2020}
}
I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.