Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, Chaowei Xiao
Warning: This repo contains examples of harmful agent action, and reader discretion is recommended.
The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility & flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks on various agents but also exhibits transferability among different agent tasks.
conda create -n AGrail python=3.9
conda activate AGrail
pip install -r requirements.txt
pip install -e .
To install Docker Desktop on Mac/Windows please refer here. Once the installation is complete, run the following command to check if Docker is working properly:
docker --version
If the installation is complete. Please create a docker image with dockerfile (install image under the same category with dockerfile) in the repo:
docker build -t ubuntu .
docker run -it ubuntu
If no error, please run all scripts and code locally.
Here is the data link for other data, if you can not find data resources, please contact the author of the corresponding dataset by Email: Mind2Web and EICU-AC \ AdvWeb\ EIA
Since prompt injection is generated based on GPT-4-Turbo's OS agent, please use GPT-4-Turbo as the foundation model of the OS agent to evaluate prompt injection attacks and GPT-4o as the foundation model for other attacks. Check and run the scripts on Safe-OS:
# Add your OPENAI_API_KEY and ANTHROPIC_API_KEY in DAS/utlis.py.
bash DAS/scripts/safe_os.sh
python eval --dataset "prompt injection" --path #put your inference result csv file here.
python eval --dataset "system sabotage" --path #put your inference result csv file here.
python eval --dataset "environment" --path #put your inference result csv file here.
python eval --dataset "benign" --path #put your inference result csv file here.
Please check the /DAS/tools/tool.py and follow the interface.
-
Weidi Luo: [email protected]
-
Chaowei Xiao: [email protected]
@misc{luo2025agraillifelongagentguardrail,
title={AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection},
author={Weidi Luo and Shenghong Dai and Xiaogeng Liu and Suman Banerjee and Huan Sun and Muhao Chen and Chaowei Xiao},
year={2025},
eprint={2502.11448},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.11448},
}