Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Language: 中文

Repository Overview

This repository hosts the code, scripts and sample data for the paper Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training (to be appeared in AAAI 2026). Link

Repository Layout

code/ — includes code & scripts for data preparation, evaluation report, SAM report, and training used in the paper.
dataset/ — includes sampled dataset for training, testing, and safety evaluation samples.
policy/ — includes two safety policy files for policy:en-US and policy:zh-CN.

Model release:

Due to the potential risks associated with the negative mode in our paper’s full model (which enables unfiltered, risk-prone generation for internal red-teaming), we have chosen not to release the original model publicly. Instead, we are releasing a closely related and safe variant: TinyR1-Safety-8B. This model shares the same core architecture and training pipeline as the paper’s model but is adapted for public and responsible use with the following key differences:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Positive mode: Generate helpful, safety-aligned responses → Use system prompt: "Safety Mode: Positive"
Rejective mode: Politely refuse unsafe requests → Use system prompt: "Safety Mode: Rejective"
General mode: For non-safety-related requests → Use system prompt: "Adherence mode: Strict adherence" This release enables researchers and developers to explore switchable safety control in a secure and transparent manner, while mitigating misuse risks.

For full details, model card, and usage examples, please visit: 👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B

Citation

If you use this repository, please cite the paper below.

@misc{si2025efficientswitchablesafetycontrol,
      title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training}, 
      author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
      year={2025},
      eprint={2508.14904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14904}, 
}

Contact

For any question, feel free to reach out via the email listed in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
dataset		dataset
policy		policy
README-zh.md		README-zh.md
README.md		README.md
architecture.png		architecture.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Repository Overview

Repository Layout

Model release:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Qihoo360/LLMs-Safety-Control

Folders and files

Latest commit

History

Repository files navigation

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Repository Overview

Repository Layout

Model release:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages