This repository contains resources for our paper "D-GEN: Automatically Generating Distractors for Reliable LLM Evaluation".
The paper is accepted to ACL 2025 Findings.
D-GEN is the first open-source Large Language Model (LLM) for distractor generation.
We evaluate the quality of distractors via Ranking Alignment Test and Entropy Analysis.
- Ranking Alignment Test
To evaluate the effectiveness of our distractor generation model, we compare two test sets:
-
MMLU: The original MMLU. The original distractors are considered as gold distractors.
-
MMLU-DGEN: A modified MMLU, where the original distractors are replaced with new distractors generated by D-GEN.
- Entropy Analysis
We compute the entropy of the predicted probability distribution over answer choices (A, B, C, and D).
Entropy quantifies the model's prediction uncertainty, allowing us to analyze how convincing the distractors are based on the model's confidence.
We observe no significant entropy differences between MMLU and MMLU-DGEN for most models, as shown in the Table above.
Models are released in two sizes: 8B and 70B.
They can be found on Hugging Face:
The MMLU-DGEN dataset is available at:
Please contact Grace Byun ([email protected]) for any inquiries.