This package evaluate VLMs' reasoning ability on IQ tests
This package depends on FlowDesign, which can be installed with:
pip install git+https://github.com/phuvinhnguyen/FlowDesign.git
After that, you can clone this repo cd use cd
to get into the IQBench folder
To ensure the name of authors are hidden, files questions.json
and questions_processed.json
are not included in the project, instead, we include file questions_processed_anonymous.json
, which can be used as questions_processed.json
. Images are also included in the repo, so you dont need to download the data (the next step).
# go to folder: IQBench/data
cd data
python process.py --json_file ./questions.json --output_folder ./images
You can run this command to the output file again to handle samples that VLM was failed to run
This command will evaluate gemini-1.5-flash on our 500 samples
# go to folder: IQBench
python -m experiments.evaluate --input_file ./data/questions_processed.json \
--output_file ./results.json \
--model_name gemini/gemini-1.5-flash \
--api_token <YOUR_TOKEN> \
--num_samples 500
Result (each sample) of the experiment will be saved in results.json
, you can compute the accuracy of each topic using this generated file.
You can run this command to the output file again to handle samples that VLM was failed to run
This command will evaluate gemini-1.5-flash on our 500 samples
# go to folder: IQBench
# input file is the output file from the previous step
python -m experiments.llmjudge --input_file ./results.json \
--output_file ./results_llmjudge.json \
--model_name gemini/gemini-1.5-flash \
--api_token <YOUR_TOKEN> \
--num_samples 500
Result (each sample) of the experiment will be saved in results_llmjudge.json
, you can compute the llmjudge score of each topic using this generated file.
Refer to all files in ./scipts
id
: Unique sample IDfile_name
: Original image file nameonline_link
: URL to online image sourcesource
: Source of the question (if collected)topic
: Topic/category of the imageauthor
: Author of the sample
####️ Image Info
link
: Renamed local image filelocal_path
/images
: Local path to downloaded image
question
/questions
: Question related to the imageanswer
: Ground-truth answer (label)pattern
: Ground-truth reasoning/explanation
ready
: Whether sample is ready for evaluation (1
or0
)
think
: VLM's reasoning before answeringbot_answer
: VLM's final answerfull_answer
: Full generated responsemodel_name
: Name of evaluated VLM
judge_think
: Judgment on VLM’s reasoningjudge_evidence
: Supporting or refuting evidencejudge_answer
: Final judgment (1
= correct,0
= incorrect)judge_model
: LLM used for judgment
{
"id": 1,
"file_name": "C1.png",
"online_link": "https://...",
"topic": "Mechanical Deductive Reasoning Test",
"author": "anonymous",
"question": "The smaller gear makes 36 revolutions by the time the larger ones make 18.\nIf the smaller gear makes 56 revolutions, how many turns will the larger one make?\nA. 24 \nB. 26 \nC. 28 \nD. 25 \n",
"answer": "C",
"pattern": "Explanation:\n\\[\nu = \\frac{18}{36} = 0.5\n\\]\n\\[\nn_{\\text{larger}} = u \\cdot 56 = 28\n\\]\n",
"source": "https://...",
"ready": 1,
"link": "image_001.png",
"local_path": "path/to/image_001.png",
"images": "path/to/image_001.png",
"questions": "The smaller gear makes 36 revolutions by the time the larger ones make 18.\nIf the smaller gear makes 56 revolutions, how many turns will the larger one make?\nA. 24 \nB. 26 \nC. 28 \nD. 25 \n",
"bot_answer": "C. 28",
"full_answer": "<think>\nLet $n_s$ be the number of revolutions of the smaller gear and $n_l$ be the number of revolutions of the larger gear.\nWe are given that when the smaller gear makes 36 revolutions, the larger gear makes 18 revolutions.\nThis can be written as a ratio:\n$\\frac{n_s}{n_l} = \\frac{36}{18} = 2$\nThis means that the smaller gear makes twice as many revolutions as the larger gear.\nWe can write this relationship as:\n$n_s = 2n_l$\nWe are given that the smaller gear makes 56 revolutions, so $n_s = 56$.\nWe can find the number of revolutions of the larger gear by substituting $n_s = 56$ into the equation:\n$56 = 2n_l$\n$n_l = \\frac{56}{2} = 28$\nTherefore, the larger gear will make 28 turns.\n\n</think>\n<answer>\nC. 28\n</answer>",
"think": "Let $n_s$ be the number of revolutions of the smaller gear and $n_l$ be the number of revolutions of the larger gear.\nWe are given that when the smaller gear makes 36 revolutions, the larger gear makes 18 revolutions.\nThis can be written as a ratio:\n$\\frac{n_s}{n_l} = \\frac{36}{18} = 2$\nThis means that the smaller gear makes twice as many revolutions as the larger gear.\nWe can write this relationship as:\n$n_s = 2n_l$\nWe are given that the smaller gear makes 56 revolutions, so $n_s = 56$.\nWe can find the number of revolutions of the larger gear by substituting $n_s = 56$ into the equation:\n$56 = 2n_l$\n$n_l = \\frac{56}{2} = 28$\nTherefore, the larger gear will make 28 turns.",
"model_name": "gemini/gemini-1.5-flash",
"judge_think": "The VLM's reasoning and the ground truth reasoning both arrive at the same correct answer (28), but they approach the problem from slightly different perspectives. The ground truth uses a ratio directly derived from the given data (18/36) to calculate a scaling factor (0.5), then applies that factor to the new number of revolutions (56). The VLM, on the other hand, establishes a direct relationship between the number of revolutions of the smaller and larger gears ($n_s = 2n_l$), then solves for the number of revolutions of the larger gear given the number of revolutions of the smaller gear. Both methods are logically sound and arrive at the same correct conclusion. The key steps \u2013 establishing the relationship between the gears' revolutions and using this relationship to find the unknown \u2013 are present in both. There are no discrepancies in the final answer.",
"judge_evidence": "The VLM's reasoning is logically valid. It correctly identifies the inverse proportional relationship between the number of revolutions of the smaller and larger gears. The equation $n_s = 2n_l$ accurately reflects this relationship, derived from the initial data. The subsequent steps of substituting $n_s = 56$ and solving for $n_l$ are mathematically sound and lead to the correct answer. The ground truth implicitly uses the same relationship, but expresses it through a ratio and scaling factor. Both methods are equivalent and lead to the same result.",
"judge_answer": "1",
"judge_model": "gemini/gemini-1.5-flash"
}
- Gemini 2.5
- Gemini 2.0
- Claude Haiku 3.5
- Claude Sonnet 3.7
- o4 mini
- gpt-4o