Skip to content

tongye98/Awesome-Code-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Ā 
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 

Repository files navigation

šŸ‘Øā€šŸ’» Awesome Code Benchmark

Awesome PRs Welcome

A comprehensive code domain benchmark review of LLM researches.

Oryx Video-ChatGPT

News

OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics from AGI-Eval

  • [2025-04-18] We add Github Stars for each banchmark.
  • [2025-04-13] We add Code Security & Robustness benchmarks.Ā 
  • [2025-04-06] We add Code Hallucinations benchmarks.Ā 
  • [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā 
  • [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā 
  • [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā 

alt text

Table of Content

Survey

  1. Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xi’an Jiaotong University

  2. Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks from Zhejiang University

šŸš€ Top Code Benchmark

Code Completion & Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HumanEval Evaluating Large Language Models Trained on Code Arxiv 2021/07 Github Stars šŸ¤—Dataset
MBPP Program Synthesis with Large Language Models Arxiv 2021/08 Github Stars šŸ¤—Dataset
DyCodeEval DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination ICML 2025 Github Stars šŸ¤—Dataset
PPM PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models FSE 2024 Github Stars šŸ¤—Dataset
APPS Measuring Coding Challenge Competence With APPS NeurIPS 2021 Github Stars šŸ¤—Dataset
CodeContests Competition-Level Code Generation with AlphaCode Science 2022 Github Stars Dataset
MultiPL-E MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation TSE 2023 Github Stars šŸ¤—Dataset
MCoNaLa MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages EACL 2023 Findings Github Stars šŸ¤—Dataset
LCC LongCoder: A Long-Range Pre-trained Language Model for Code Completion ICML 2023 Github Dataset
CodeClarQA Python Code Generation by Asking Clarification Questions ACL 2023 Github Stars Dataset
EvalPlus Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation NeurIPS 2023 Github Stars šŸ¤—Dataset šŸ“ŠLeaderBoard
CrossCodeEval CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NeurIPS 2023 Github Stars Dataset
ODEX Execution-Based Evaluation for Open-Domain Code Generation EMNLP 2023 Findings Github Stars Dataset
HumanEval-X CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X SIGKDD 2023 Github Stars Dataset
ML-Bench ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code Arxiv 2023/11 Github Stars šŸ¤—Dataset 🌐Website
RepoBench RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems ICLR 2024 Github Stars šŸ¤—Dataset
CatCoder Enhancing Repository-Level Code Generation with Integrated Contextual Information Arxiv 2024/06
StudentEval StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024 Findings GithubStars šŸ¤—Dataset
DevEval DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories ACL 2024 Github Stars šŸ¤—Dataset
CoderEval CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models ICSE 2024 Github Stars
ConCodeEval ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages Arxiv 2024/07
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
OOP OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models ACL 2024 Findings Github Stars šŸ¤—Dataset
L2CEval L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models TACL 2024
HumanExtension Exploring Language Model's Code Generation Ability with Auxiliary Functions NAACL 2024 Findings Github Stars šŸ¤—Dataset
LLM4Decompile LLM4Decompile: Decompiling Binary Code with Large Language Models EMNLP 2024 GithubStars šŸ¤—Dataset
PYCOMMITS Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing ICLR 2024 Github Stars Dataset
CodeAgentBench CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges ACL 2024
SAFIM Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks ICML 2024 GithubStars šŸ¤—Dataset
BigCodeBench BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions ICLR 2025 Github Stars šŸ¤—Dataset šŸ“ŠLeaderBoard
EvoCodeBench EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories NeurIPS 2025 Github Stars šŸ¤—Dataset
DynaCode DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation Arxiv 2025/03
A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs EASE 2025
LeetCodeDataset LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Arxiv 2025/04 GithubStars šŸ¤—Dataset
CodeFlowBench CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation Arxiv 2025/04 GithubStars šŸ¤—Dataset
CodeMixBench CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts Arxiv 2025/05 šŸ¤—Dataset
CPRet CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming Arxiv 2025/05 GithubStars
ELABORATION ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming ACL 2025 GithubStars
OSS-Bench OSS-Bench: Benchmark Generator for Coding LLMs Arxiv 2025/05 Github Stars šŸ¤—Dataset šŸ“ŠLeaderBoard
VERINA VERINA: Benchmarking Verifiable Code Generation Arxiv 2025/05 Github Stars šŸ¤—Dataset
OIBench OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics Arxiv 2025/06 šŸ¤—Dataset

Code Efficiency

Benchmark Paper Date Github Dataset & Website & LeaderBoard
EvalPerf Evaluating Language Models for Efficient Code Generation COLM 2024 Github Stars šŸ¤—Dataset 🌐Website
EffiBench EffiBench: Benchmarking the Efficiency of Automatically Generated Code NeurIPS 2024 Github Stars
Mercury Mercury: A Code Efficiency Benchmark for Code Large Language Models NeurIPS 2024 Github Stars šŸ¤—Dataset
ECCO ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? EMNLP 2024 Github Stars šŸ¤—Dataset
PIE Learning Performance-Improving Code Edits ICLR 2024 Github Stars 🌐Website
ENAMEL How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark ICLR 2025 Github Stars šŸ¤—Dataset
Improving Assembly Code Performance with Large Language Models via Reinforcement Learning Arxiv 2025/05
EFFIBENCH-X EFFIBENCH-X:A Multi-Language Benchmark fo rMeasuring Effciency ofLLM.Generated Code Arxiv 2025/05 Github Stars šŸ¤—Dataset
PERFFORGE Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency Arxiv 2025/05

CodeFix & Bug-Fix

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Buggy-HumanEval&Buggy-FixEval Large Language Models of Code Fail at Completing Code with Potential Bugs NeurIPS 2023 GithubStars Dataset
SWT-Bench SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents NeurIPS 2024 Github Stars 🌐Website
HumanEvalPack OctoPack: Instruction Tuning Code Large Language Models ICLR 2024 GithubStars šŸ¤—Dataset
SWE-bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024 Github Stars 🌐Website
GitBug-Java GitBug-Java: A Reproducible Benchmark of Recent Java Bugs MSR 2024 GithubStars šŸ¤—Dataset 🌐Website
GitBug-Actions GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions ICSE 2024 Demo Github Stars ā–¶ļøVideo
RepoBugs When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done? ICSE 2024 Industry Track
RepoFixEval RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing OpenReview 2024 Link
DebugBench DebugBench: Evaluating Debugging Capability of Large Language Models ACL 2024 Github Stars šŸ¤—Dataset
Multi-Bug Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging EMNLP 2024 Findings Github Stars
Coffee-Gym Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code EMNLP 2024 šŸ¤—Dataset
INTERVENOR INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing ACL 2024 Findings Github Stars
StatType-SO ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs TOSEM 2024
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard
COAST COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis NAACL 2025 GithubStars šŸ¤—Dataset
SWE-bench Multimodal SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website
FeedbackEval FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks GithubStars
CVE-Bench CVE-Bench:Benchmarking LLM-based Software Engineering Agent’s Ability to Repair Real-World CVE Vulnerabilities NAACL 2025 GithubStars Dataset
OmniGIRL OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution ISSTA 2025 GithubStars šŸ¤—Dataset šŸ“ŠLeaderBoard
LongSWE-Bench LongCodeBench: Evaluating Coding LLMs at 1M Context Windows Arxiv 2025/05 šŸ¤—Dataset
VADER VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation Arxiv 2025/06 GithubStars
Breakpoint Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents Arxiv 2025/05

Code Reasoning & Understanding

Benchmark Paper Date Github Dataset & Website & LeaderBoard
GenCodeSearchNet GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding EMNLP 2023 GithubStars šŸ¤—Dataset
CRUXEval CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Arxiv 2024/01 Github Stars šŸ“ŠLeaderBoard
Poor-CodeSumEval How Effectively Do Code Language Models Understand Poor-Readability Code? ASE 2024 Github Stars šŸ¤—Dataset
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
CodeJudge-Eval CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? COLING 2025 Github Stars
CodeMMLU CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard
LongCodeQA LongCodeBench: Evaluating Coding LLMs at 1M Context Windows Arxiv 2025/05 šŸ¤—Dataset
CTF-Code Success is in the Details: Evaluate and Enhance Details Sensitivity of Code Arxiv 2025/05
CodeSense CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning Arxiv 2025/06 Github Stars šŸ¤—DatasetšŸ“ŠLeaderBoard
CETBench CETBench: A Novel Dataset constructed via Transformations over Arxiv 2025/06
ICPC-Eval ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests Arxiv 2025/06 GithubStars šŸ¤—Dataset

Code Hallucination

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HALLUCODE Exploring and Evaluating Hallucinations in LLM-Powered Code Generation Arxiv 2024/04
Collu-Bench Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code Arxiv 2024/10 šŸ¤—Dataset
CodeHalu CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification AAAI 2025 GithubStars šŸ¤—Dataset
APIHulBench Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware FSE 25 GithubStars
THINK THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge SANER 2025 GithubStars šŸ¤—Dataset

Data science

Benchmark Paper Date Github Dataset & Website & LeaderBoard
DS-1000 DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation ICML 2023 GithubStars šŸ¤—Dataset 🌐HomePage
ARCADE Natural Language to Code Generation in Interactive Data Science Notebooks ACL 2023 Github Stars Dataset
DA-Code DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP 2024 GithubStars šŸ¤—Dataset 🌐Website
MatPlotBench MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization ACL 2024 Findings GithubStars šŸ¤—Dataset
DataSciBench DataSciBench: An LLM Agent Benchmark for Data Science ArXiv 2025/02 GithubStars
DSBench DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? ICLR 2025 GithubStars šŸ¤—Dataset
DS-Bench DS-Bench: A Realistic Benchmark for Data Science Code Generation Arxiv 2025/05 GithubStars

Text2SQL

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Spider Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task EMNLP 2018 GithubStars 🌐Website
SParC SParC: Cross-Domain Semantic Parsing in Context ACL 2019 Github Stars 🌐Website
CoSQL CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases EMNLP 2019 Github Stars 🌐Website
Spider-DK Exploring underexplored limitations of crossdomain text-to-sql generalization EMNLP 2021 Github Stars
Spider-Syn Towards robustness of text-to-SQL models against synonym substitution ACL 2021 Github Stars
Spider-Realistic Structure-Grounded Pretraining for Text-to-SQL NAACL 2021 Dataset
BIRD Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs NeurIPS 2023 Github Stars 🌐Website
Dr.Spider Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness ICLR 2023 GithubStars
BookSQL BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain NAACL 2024 Github Stars Dataset
Archer Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning EACL 2024 🌐Website
SecureSQL SecureSQL: Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases EMNLP 2024 Findings Github Stars Dataset
Spider 2.0 Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows ICLR 2025 Github Stars 🌐Website
SNAILS SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference PACMMOD 2025 GithubStars
SQL2Text Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text COLING 2025 GithubStars Dataset

MultiModal Code Tasks

Benchmark Paper Date Github Dataset & Website & LeaderBoard
MMCode MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems EMNLP 2024 GithubStars šŸ¤—Dataset
Drawing Pandas Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code ArXiv 2024/12 GithubStars šŸ¤—Dataset
Web2Code Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs NeurIPS 2024 GithubStars šŸ¤—Dataset
🌐Website
VGBench VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation EMNLP 2024 Github Stars šŸ¤—Dataset
SVGEditBench SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities CVPR2024 workshop Github Stars šŸ¤—Dataset
Plot2Code Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Arxiv 2024/05 GithubStars šŸ¤—Dataset
HumanEval-V HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks ArXiv 2024/10 GithubStars 🌐Website
šŸ“ŠLeaderBoard
šŸ¤—Dataset
WebSight-Test WAFFLE: Multi-Modal Model for Automated Front-End Development Arxiv 2024/10 GithubStars šŸ¤—Dataset
Sketch2Code Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping Arxiv 2024/10 GithubStars 🌐Website
Interaction2Code Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping Arxiv 2024/11 GithubStars šŸ¤—Dataset
šŸ“ŠLeaderBoard
ScratchEval ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges Arxiv 2024/11 GithubStars šŸ¤—Dataset
MRWeb MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs Arxiv 2024/12 GithubStars šŸ¤—Dataset
Image2Struct Image2Struct: Benchmarking Structure Extraction for Vision-Language Models NeurIPS 2024 GithubStars 🌐Website
šŸ¤—Dataset
BigDocs-Bench BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks ICLR 2025 šŸ¤—Dataset
🌐Website
WebCode2M WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs WWW 2025 Github 🌐Website
šŸ¤—Dataset
Design2Code Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering NAACL 2025 GithubStars šŸ¤—Dataset
DiagramGenBenchmark From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing CVPR 2025 GithubStars 🌐Website
šŸ¤—Dataset
ChartMimic ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation ICLR 2025 GithubStars 🌐Website šŸ¤—Dataset
SVG-Bench StarVector: Generating Scalable Vector Graphics Code from Images and Text CVPR 2025 Github Stars 🌐Website
šŸ¤—Dataset
LLM4SVG Empowering LLMs to Understand and Generate Complex Vector Graphics CVPR 2025 GithubStars 🌐Website
ChartCoder ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation Arxiv 2025/01 Github Stars šŸ¤—Dataset
Code-Vision Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities Arxiv 2025/02
Flame-React-Eval Advancing vision-language models in front-end development via data synthesis Arxiv 2025/03 Github šŸ¤—Dataset
vTikZ LLM Code Customization with Visual Results: A Benchmark on TikZ EASE 2025
Plot2XML Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation Arxiv 2025/04
Flow2Code Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability Arxiv 2025/06 Github
DesignBench DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation Arxiv 2025/06 GithubStars šŸ¤—Dataset
WebUIBench WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Arxiv 2025/06 GithubStars šŸ¤—Dataset

Code Security & Robustness

Benchmark Paper Date Github Dataset & Website & LeaderBoard
COCO COCO: Testing Code Generation Systems via Concretized Instructions Arxiv 2023/08 Github Stars
ReCode ReCode: Robustness Evaluation of Code Generation Models ACL 2023 Github Stars Dataset
RedCode RedCode: Risky Code Execution and Generation Benchmark for Code Agents NeurIPS 2024 Github Stars 🌐Website šŸ“ŠLeaderBoard
CodeWMBench CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation ACM-TURC 2024 Github Stars
RMCBench RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code ASE 2024 Github Stars šŸ¤—Dataset
PyP4LLMSec Benchmarking the Security Aspect of Large Language Model-Based Code Generation ICSE 2024 Github Stars Dataset
CWE-Bench-Java IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities Arxiv 2024/05 Github
CyberSecEval 3 CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Arxiv 2024/08 Github Dataset
CS-Eval CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity Arxiv 2024/11 Github Stars šŸ¤—Dataset
SecBench SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity Arxiv 2024/12   Dataset 🌐Website
aiXamine aiXamine: Simplified LLM Safety and Security Arxiv 2025/04 🌐Website
SafeGenBench SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code Arxiv 2025/06

Code Translation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
TransCoder Unsupervised Translation of Programming Languages NeurIPS 2020 Github(deprecated) Github(new) Stars Dataset
AVATAR AVATAR: A Parallel Corpus for Java-Python Program Translation ACL Findings 2023 Github Stars Dataset
G-TransEval On the Evaluation of Neural Code Translation: Taxonomy and Benchmark ASE 2023 Github Stars šŸ¤—Dataset
CodeTransOcean CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation EMNLP 2023 Github Stars šŸ¤—Dataset
xCodeEval XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval ACL 2024 GithubStars šŸ¤—Dataset
PolyHumanEval Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? APSEC 2024 GithubStars šŸ¤—Dataset
RustRepoTrans Repository-level Code Translation Benchmark Targeting Rust Arxiv 2024/11 Github Stars šŸ¤—Dataset
ClassEval-T Escalating LLM-based Code Translation Benchmarking into the Class-level Era Arxiv 2024-11 GithubStars šŸ¤—Dataset
TRANSREPO-BENCH Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation Arxiv 2025/01 Github Stars šŸ¤—Dataset
LongTrans Enhancing LLMs in Long Code Translation through Instrumentation and Program State Alignment Arxiv 2025/04
CRUST-Bench CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation Arxiv 2025/04 GithubStars Dataset

Code Version

Benchmark Paper Date Github Dataset & Website & LeaderBoard
CodeUpdateEval Automatically Recommend Code Updates: Are We There Yet? TOSEM 2024 Github Stars šŸ¤—Dataset
JavaVersionGenBench On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions ICPC 2024 GithubStars šŸ¤—Dataset
VersiCode VersiCode: Towards Version-controllable Code Generation Arxiv 2024/10 Github Stars 🌐Website šŸ¤—Dataset
GitChameleon GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Arxiv 2024/11 Github Stars šŸ¤—Dataset
LLM-Deprecated-APl LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion ICSE 2025 Github Stars šŸ¤—Dataset
LibEvolutionEval LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation NAACL 2025 🌐Website
CodeUpdateArena CodeUpdateArena: Benchmarking Knowledge Editing on API Updates Arxiv 2025/02 Github Stars šŸ¤—Dataset
RustEvo2 RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation Arxiv 2025/03 Github Stars šŸ¤—Dataset
CODEMENV CODEMENV: Benchmarking Large Language Models on Code Migration Arxiv 2025/06 GithubStars šŸ¤—Dataset

Multi & Other Dimension

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Stack-Repo RepoFusion: Training Code Models to Understand Your Repository Arxiv 2023/06 GithubStars šŸ¤—Dataset
MultiNL-H Improving Natural Language Capability of Code Large Language Model Arxiv 2024/01 GithubStars
HumanEvalPack OctoPack: Instruction Tuning Code Large Language Models ICLR 2024 GithubStars šŸ¤—Dataset
CodeBenchGen CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks Arxiv 2024/04 GithubStars Dataset
X-HumanEval-X Exploring Multi-Lingual Bias of Large Code Models in Code Generation Arxiv 2024/04
RACE Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models Arxiv 2024/07 GithubStars šŸ“ŠLeaderBoard
RealWorld-Bench What's Wrong with Your Code Generated by Large Language Models? An Extensive Study Arxiv 2024/07
APPS+ StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback ACL 2024 GithubStars Dataset
InfiBench InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models NeurIPS 2024 GithubStars 🌐Website
RobustAPI Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation AAAI 2024 GithubStars šŸ¤—Dataset
EvoEval Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM COLM 2024 Github Stars
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
AssertionBench AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation NAACL 2025 GithubStars
REval Evaluating Large Language Models with Runtime Behavior of Program Execution ICSE 2025 GithubStars šŸ“ŠLeaderBoard
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard
SWE-PolyBench SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents Arxiv 2025/04 Github Stars 🌐Website šŸ¤—Dataset
Paper2Code Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Arxiv 2025/04 GithubStars šŸ¤—Dataset
LiCoEval LiCoEval: Evaluating LLMs on License Compliance in Code Generation ICSE 2025 GithubStars Dataset
CoCo-Bench CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation Arxiv 2025/04
CodeRepetEval Rethinking Repetition Problems of LLMs in Code Generation ACL 2025 GithubStars
WebGen-Bench WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch Arxiv 2025/03 GithubStars šŸ¤—Dataset
DecompileBench DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios Arxiv 2025/05 GithubStars
CLEVER CLEVER:A Curated Benchmark for Formally Verified Arxiv 2025/05 GithubStars šŸ¤—Dataset
ResearchCodeBench ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code Arxiv 2025/06

Industry Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
VerilogEval VerilogEval Evaluating Large Language Models for Verilog Code Generation ICCAD 2023 GithubStars šŸ¤—Dataset
VGen Benchmarking Large Language Models for Automated Verilog RTL Code Generation DATE 2023 GithubStars šŸ¤—Dataset
RTLLM RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model ASPDAC 2024 GithubStars šŸ¤—Dataset
LLM4PLC LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems ICSE 2024 GithubStars 🌐Website
Agents4PLC Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents Arxiv 2024/10 GithubStars šŸ¤—Dataset
A Multi-Agent Framework for Extensible Structured Text Generation in PLCs Arxiv 2024/12
OpenLLM-RTL OpenLLM-RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation ICCAD 2024 GithubStars šŸ¤—Dataset
MG-Verilog MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation ISLAD 2024 GithubStars
RTL-Repo RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects LAD 2024 GithubStars šŸ¤—Dataset
MetRex MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs ASPDAC 2025 GithubStars šŸ¤—Dataset
ComplexVCoder ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code Arxiv 2025/04