The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This tutorial provides a systematic survey of the field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along:
- Evaluation objectives (what to evaluate): agent behavior, capabilities, reliability, safety
- Evaluation process (how to evaluate): interaction modes, datasets and benchmarks, metric computation methods, and tooling
In addition, we highlight enterprise-specific challenges, such as role-based access, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance. Finally, we discuss future research directions toward holistic, more realistic, and scalable evaluation of LLM agents.
You can read the pre-print here.
This tutorial is designed for applied and industry data scientists, machine learning engineers, and enterprise AI practitioners who build or deploy LLM-based agents in production systems. It is also relevant for academic researchers studying evaluation methodologies, multi-agent systems, and trustworthy language models. Participants will gain a systematic evaluation framework, practical hands-on code examples, and insights into real-world deployment challenges.
- Introduction (5 min)
- Motivation and tutorial goals
- Taxonomy Overview (5 min)
- What to evaluate and how to evaluate
- Evaluation Process (25 min)
- Interaction modes
- Evaluation data
- Metric computation methods
- Evaluation tooling
- Evaluation contexts
- Evaluation Objectives (90 min)
- Agent Behavior
- Agent Capabilities
- Reliability
- Safety & Alignment
- Enterprise-Specific Challenges (20 min)
- Access control
- Reliability guarantees
- Dynamic & long-horizon interactions
- Policy and compliance
- Future Directions (5 min)
- Holistic frameworks
- Scalable evaluation
- Realistic enterprise settings
- Time & cost bounded protocols
-
Mahmoud Mohammadi — SAP Labs Mahmoud is a Senior AI Scientist at SAP, where his research focuses on business foundation models and agentic AI, including graph foundation models, LLM integration, and the evaluation of intelligent agents. He also has expertise in Generative Adversarial Networks (GANs) and multimodal AI systems. Previously, Mahmoud worked at Microsoft, where he contributed to developing client-side deep learning models for Windows. He holds a Ph.D. in Computer Science form university of North Carolina at Charlotte.
[email protected] -
Yipeng Li — SAP Labs, Bellevue, WA, USA
Yipeng Li is a Data Scientist Expert at SAP, leading research and development in agentic AI. His work focuses on single and multi-agent systems, quality assessment, and enabling internal AI research and development through common platforms.
[email protected] -
Jane Lo — SAP Labs, Palo Alto, CA, USA
Jane Lo is an AI Scientist at SAP, focusing on the research and development of agentic AI. She has worked on several projects in the field, focusing on the integration of enterprise tools, data, and private knowledge with agentic systems across a wide range of conversational and autonomous use cases.
[email protected] -
Wendy Yip — SAP Labs, Palo Alto, CA, USA
Wendy is a Senior Data Scientist in SAP and has a background in astrophysics and spent time on machine learning & data-intensive science research at Johns Hopkins University.
[email protected]
Copyright (c) 2025 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.