Language: 中文
This repository hosts the code, scripts and sample data for the paper Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training (to be appeared in AAAI 2026). Link
code/— includes code & scripts for data preparation, evaluation report, SAM report, and training used in the paper.dataset/— includes sampled dataset for training, testing, and safety evaluation samples.policy/— includes two safety policy files for policy:en-US and policy:zh-CN.
Due to the potential risks associated with the negative mode in our paper’s full model (which enables unfiltered, risk-prone generation for internal red-teaming), we have chosen not to release the original model publicly. Instead, we are releasing a closely related and safe variant: TinyR1-Safety-8B. This model shares the same core architecture and training pipeline as the paper’s model but is adapted for public and responsible use with the following key differences:
- Positive mode: Generate helpful, safety-aligned responses → Use system prompt: "Safety Mode: Positive"
- Rejective mode: Politely refuse unsafe requests → Use system prompt: "Safety Mode: Rejective"
- General mode: For non-safety-related requests → Use system prompt: "Adherence mode: Strict adherence" This release enables researchers and developers to explore switchable safety control in a secure and transparent manner, while mitigating misuse risks.
For full details, model card, and usage examples, please visit: 👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B
If you use this repository, please cite the paper below.
@misc{si2025efficientswitchablesafetycontrol,
title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training},
author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
year={2025},
eprint={2508.14904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14904},
}For any question, feel free to reach out via the email listed in the paper.
