KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

📄 Abstract

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This tutorial provides a systematic survey of the field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along:

Evaluation objectives (what to evaluate): agent behavior, capabilities, reliability, safety
Evaluation process (how to evaluate): interaction modes, datasets and benchmarks, metric computation methods, and tooling

In addition, we highlight enterprise-specific challenges, such as role-based access, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance. Finally, we discuss future research directions toward holistic, more realistic, and scalable evaluation of LLM agents.

You can read the pre-print here.

🎯 Target Audience

This tutorial is designed for applied and industry data scientists, machine learning engineers, and enterprise AI practitioners who build or deploy LLM-based agents in production systems. It is also relevant for academic researchers studying evaluation methodologies, multi-agent systems, and trustworthy language models. Participants will gain a systematic evaluation framework, practical hands-on code examples, and insights into real-world deployment challenges.

🗂️ Tutorial Agenda

Introduction (5 min)
- Motivation and tutorial goals
Taxonomy Overview (5 min)
- What to evaluate and how to evaluate
Evaluation Process (25 min)
- Interaction modes
- Evaluation data
- Metric computation methods
- Evaluation tooling
- Evaluation contexts
Evaluation Objectives (90 min)
- Agent Behavior
- Agent Capabilities
- Reliability
- Safety & Alignment
Enterprise-Specific Challenges (20 min)
- Access control
- Reliability guarantees
- Dynamic & long-horizon interactions
- Policy and compliance
Future Directions (5 min)
- Holistic frameworks
- Scalable evaluation
- Realistic enterprise settings
- Time & cost bounded protocols

👩‍💻 Authors

Mahmoud Mohammadi — SAP Labs Mahmoud is a Senior AI Scientist at SAP, where his research focuses on business foundation models and agentic AI, including graph foundation models, LLM integration, and the evaluation of intelligent agents. He also has expertise in Generative Adversarial Networks (GANs) and multimodal AI systems. Previously, Mahmoud worked at Microsoft, where he contributed to developing client-side deep learning models for Windows. He holds a Ph.D. in Computer Science form university of North Carolina at Charlotte.
[email protected]
Yipeng Li — SAP Labs, Bellevue, WA, USA
Yipeng Li is a Data Scientist Expert at SAP, leading research and development in agentic AI. His work focuses on single and multi-agent systems, quality assessment, and enabling internal AI research and development through common platforms.
[email protected]
Jane Lo — SAP Labs, Palo Alto, CA, USA
Jane Lo is an AI Scientist at SAP, focusing on the research and development of agentic AI. She has worked on several projects in the field, focusing on the integration of enterprise tools, data, and private knowledge with agentic systems across a wide range of conversational and autonomous use cases.
[email protected]
Wendy Yip — SAP Labs, Palo Alto, CA, USA
Wendy is a Senior Data Scientist in SAP and has a background in astrophysics and spent time on machine learning & data-intensive science research at Johns Hopkins University.
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
LICENSES		LICENSES
datasets/purchase_order		datasets/purchase_order
logs		logs
sample_logs		sample_logs
.gitmodules		.gitmodules
2025_KDD_Evaluation_and_Benchmarking_of_LLM_Agents.pdf		2025_KDD_Evaluation_and_Benchmarking_of_LLM_Agents.pdf
LICENSE		LICENSE
README.md		README.md
REUSE.toml		REUSE.toml
agent_eval_react_purchase_order.ipynb		agent_eval_react_purchase_order.ipynb
index.html		index.html
load_datasets_purchase_orders.ipynb		load_datasets_purchase_orders.ipynb
requirements.txt		requirements.txt
sap_logo.jpeg		sap_logo.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

📄 Abstract

🎯 Target Audience

🗂️ Tutorial Agenda

👩‍💻 Authors

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

SAP-samples/llm-agents-eval-tutorial

Folders and files

Latest commit

History

Repository files navigation

KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

📄 Abstract

🎯 Target Audience

🗂️ Tutorial Agenda

👩‍💻 Authors

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages