qb2nq (QuizBowl 2 Natural Questions transformation) is a project to transform complicated trivia questions in the quizbowl dataset to simpler Natural Questions (NQ) dataset for better Question-Answering (QA) performance.
Please run our code by git cloning the repository, then changing directory to our repository followed by the following commands.
Running make prereqs download required datasets and install required packages:
make prereqs
Running make clean would delete all intermediate and final results generated by the program:
make clean
The first step is to examine how different answers are referred to in the dataset.
python3 intermediate_results/lat_frequency.json
Next, we transform question from the QB format to look like the NQ format.
make intermediate_results/nq_like_questions.json
This is quite slow (it could probably be parallelized), so it might be better to test it out with a small number of questions. This will transform 100 questions.
python3 transform_question.py --limit=100
Example: Original QB Question (Elicitation)
Original QB elicitation:
This river forms the Bujagali and Murchison Falls in its "Victoria" incarnation, and it also contains a segment named after Lake Albert.
One of its deltas forms the Sudd wetland region, and the Jonglei canal was proposed to reroute part of it around the Sudd.
Its headstreams include the Luvironza, and the Owen Falls Dam used it for hydroelectric power until 2006.
Called the Bahr al Jabal upon entering Sudan and joining the Bahr el Ghazal at Lake No, it originates near Jinja in Lake Victoria.
For 10 points, name this river that ends in Khartoum and unites with its blue counterpart as part of the longest river in the world.
Answer: White Nile
Heuristic 1: Split via Conjunctions
This river forms the Bujagali and Murchison Falls in its “Victoria” incarnation.
It contains a segment named after Lake Albert.
Heuristic 2: Conversion when there is no wh-words
Which river forms the Bujagali and Murchison Falls in its "Victoria" incarnation?
Which river contains a segment named after Lake Albert?
Heuristic 3: Conversion from Imperative to Interrogative
Which river unites with its blue counterpart as part of the longest river in the world?
After that, the next step is to run a classifer to distinguish QB questions from NQ questions.
make intermediate_results/logistic_regression_weight_dict_Qb_NQ.txt
This can reveal mistakes/problems in the transformation process. For instance, if you look at this feature set:
{"length": 2.1673546575675235, "ablength": 0.0, "START the": -0.1784736506085197, "START what": 0.6402255772376372, "START when": -0.5080459895220035, "START where": -1.074377737159208, "START who": -0.9162552042417396, "after the": 0.45851831536207965, "as the": 0.40412419433014296, "battle of": 0.1995002103612401, "black history": -0.24677112851309396, "by the": 0.3487353258571355, "did the": -0.6921127080342989, "filmed STOP": -0.23510293344821942, "guadeloupe in": 0.0827023037431746, "in his": 0.5877131237947829, "in the": -0.2604679092872282, "in what": 0.0827023037431746, "india STOP": -0.4548609004189222, "is the": -0.5464633523104327, "keep guadeloupe": 0.0827023037431746, "life of": 0.2261374967836842, "my heart": -0.35833224553475324, "not to": 0.0827023037431746, "of india": -0.30017948289961044, "of the": -0.36922395137105696, "of this": 0.23468026711909745, "on the": 0.11904041390054465, "part of": 0.10193534107347967, "ruler of": 0.5134096172530246, "series STOP": 0.3476276322692837, "the british": -0.07502072137405887, "the first": 0.13955619294729724, "the life": 0.2261374967836842, "the most": -0.045238416997391326, "the world": -0.2296368046047485, "this event": 0.23468026711909745, "to keep": 0.0827023037431746, "was new": -0.4769144853770957, "was partly": 0.20061171775689368, "was the": -0.5019693682834611, "what author": 0.4538406936817783, "what is": -0.13603736395825397, "what man": 0.8408500019504009, "what was": -0.6029497385045198, "what what": 0.8078417803364502, "when did": -0.5080459895220035, "where is": -0.1354608211486916, "where was": -0.23510293344821942, "which can": 0.2556495617992419, "which is": -0.01747670564064363, "who wrote": -0.3121178156168799, "BIAS": -5.441712535065695}
This suggests that our nq-like questions start with "the" too often and have "of this".
The orriginal QANTA can be found at https://sites.google.com/view/qanta/resources?authuser=0 The data can be found here. https://drive.google.com/drive/u/1/folders/1mebfGC5AakYHdmRLUf718oAsfEU8tcYM
Saptarashmi Bandyopadhyay
Hao Zou
Chenqi Zhu
Shraman Pal
Abhranil Chandra
Rohith Banda