All datasets are located on /datasets a volume exclusively for datasets like IR test collections, document corpora or other forms of data that is used in our research.
| Dataset | Creator | Year | Size | Type | Usecase |
|---|---|---|---|---|---|
| AOL | G. Pass, A. Chowdhury, C. Torgeson | 2006 | 2,1G (zipped) | IR test collection | personalization, query reformulation or other types of search research |
| semanticscholar | Waleed Ammar | 2019 | 46G (zipped) | document corpora | ad-hoc retrieval |
| iSearch | Aalborg University | 2010 | 50G (zipped) | IR test collection | Integreated search and citation-based retrieval |
| Washington Post | NIST | 2018 | 1.5G (zipped) | IR test collection | ad-hoc retrieval |
| Washington Post (v4) | NIST | 2021 | 2.4G (zipped) | IR test collection | ad-hoc retrieval |
| Tipster 1/2/3 | NIST | 1994 | 1.3G (zipped) | IR test collection | ad-hoc retrieval |
| TREC Disks 4/5 | NIST | 1997 | 820MB (zipped) | document corpora | ad-hoc retrieval |
| New York Times | Evan Sandhaus | 2008 | 1G (zipped) | document corpora | ad-hoc retrieval |
| AQUAINT | David Graff | 2002 | 3G (zipped) | document corpora | ad-hoc retrieval |
| GIRT4 | GESIS-IZ | 2006 | 110M (zipped) | IT test collection | ad-hoc retrieval, domain-specific, multilingual |
| TripClick | Navid Rekab-saz, Oleg Lesota, Markus Schedl, Jon Brassey, Carsten Eickhoff | 2021 | 32.7G (zipped) | Click log dataset | ad-hoc retrieval, deep learning models |
| Yahoo-L18 | Yahoo! Research | 2009/10 | 1.3G (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
| Yandex - Personalized Web Search Challenge | Eugene Kharitonov, Pavel Serdyukov | 2014 | 5.9G (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
| TREC-OpenSearch | TREC OpenSearch Organizers | 2016/17 | 600M (zipped) | Click log dataset | ad-hoc retrieval, session analysis |
- Login on
linux2. - Create a new folder for the dataset and copy the
README.template.mdin the new folder. Rename the file toREADME.md - Describe the data set along the template.
- Copy all files for the dataset to the folder and add all binary files and folder to
.gitignore. - Commit the
README.mdand all the additional files you would like to see on GitHub. - Update this page to include a brief description of the dataset.