A comprehensive code domain benchmark review of LLM researches.
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics from AGI-Eval
-
š„š„ [2025-06-14] Featured Benchmarks:
š„ DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination from Columbia University
š„ PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models from University of Texas at Dallas
š„ ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming from Sichuan University
š„ OSS-Bench: Benchmark Generator for Coding LLMs from National University of Singapore
š„ VERINA: Benchmarking Verifiable Code Generation from University of California, Berkeley
š„ ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code from Stanford University
š„ EFFIBENCH-X:A Multi-Language Benchmark fo rMeasuring Effciency ofLLM.Generated Code from HKU
š„ Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency from The University of Chicago
š„ Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents from MIT
š„ LongCodeBench: Evaluating Coding LLMs at 1M Context Windows from Panasonic AI Research
š„ Success is in the Details: Evaluate and Enhance Details Sensitivity of Code from Harbin Institute of Technology
š„ CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning from Iowa State University
š„ Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability from East China Normal University
š„ Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation from Nanjing University of Information Science & Technology
š„ CODEMENV: Benchmarking Large Language Models on Code Migration from Provable Responsible AI and Data Analytics (PRADA) Lab
š„ DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios from Tsinghua University
-
š„š„ [2025-05-30] Featured Benchmarks:
š„ VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation from UC Berkeley
-
š„š„ [2025-05-26] Featured Benchmarks:
š„ CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming from Tsinghua University
š„ DS-Bench: A Realistic Benchmark for Data Science Code Generation from Kingās College London
š„ ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming from Sichuan University
-
š„š„ [2025-05-20] Featured Benchmarks:
š„ Rethinking Repetition Problems of LLMs in Code Generation from Peking University
š„ WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch from Chinese University of Hong Kong
š„ OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution from Sun Yat-sen University
š„ CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation from Peking University
š„ Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware from Harbin Institute of Technology, Shenzhen
š„ CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts from Chandigarh University
š„ LongCodeBench: Evaluating Coding LLMs at 1M Context Windows from Panasonic AI Research
š„ DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios from Tsinghua University
š„ Improving Assembly Code Performance with Large Language Models via Reinforcement Learning from Stanford University
- [2025-04-18] We add Github Stars for each banchmark.
- [2025-04-13] We add Code Security & Robustness benchmarks.Ā
- [2025-04-06] We add Code Hallucinations benchmarks.Ā
- [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā
- [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā
- [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā
- Code Completion & Code Generation
- Code Efficiency
- CodeFix & Bug-Fix
- Code Reasoning & Understanding
- Code Hallucination
- Data science
- Text2SQL
- MultiModal Code Tasks
- Code Security & Robustness
- Code Translation
- Code Version
- Multi & Other Dimension
- Industry Code Generation
-
Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xiāan Jiaotong University
-
Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks from Zhejiang University
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
HALLUCODE | Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | Arxiv 2024/04 | ||
Collu-Bench | Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code | Arxiv 2024/10 | š¤Dataset | |
CodeHalu | CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification | AAAI 2025 | Github |
š¤Dataset |
APIHulBench | Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware | FSE 25 | Github |
|
THINK | THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge | SANER 2025 | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github |
š¤Dataset šHomePage |
ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github |
Dataset |
DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github |
š¤Dataset šWebsite |
MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github |
š¤Dataset |
DataSciBench | DataSciBench: An LLM Agent Benchmark for Data Science | ArXiv 2025/02 | Github |
|
DSBench | DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? | ICLR 2025 | Github |
š¤Dataset |
DS-Bench | DS-Bench: A Realistic Benchmark for Data Science Code Generation | Arxiv 2025/05 | Github |