Skip to content

Qihoo360/LLMs-Safety-Control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Language: 中文

Repository Overview

This repository hosts the code, scripts and sample data for the paper Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training (to be appeared in AAAI 2026). Link

Multi-Directional Distillation and Magic-Token-Guided Co-Training Framework

Repository Layout

  • code/ — includes code & scripts for data preparation, evaluation report, SAM report, and training used in the paper.
  • dataset/ — includes sampled dataset for training, testing, and safety evaluation samples.
  • policy/ — includes two safety policy files for policy:en-US and policy:zh-CN.

Model release:

Due to the potential risks associated with the negative mode in our paper’s full model (which enables unfiltered, risk-prone generation for internal red-teaming), we have chosen not to release the original model publicly. Instead, we are releasing a closely related and safe variant: TinyR1-Safety-8B. This model shares the same core architecture and training pipeline as the paper’s model but is adapted for public and responsible use with the following key differences:

1. No secret "magic tokens" — control is performed via plain-text system prompts.

2. Only safe behaviors are exposed:

  • Positive mode: Generate helpful, safety-aligned responses → Use system prompt: "Safety Mode: Positive"
  • Rejective mode: Politely refuse unsafe requests → Use system prompt: "Safety Mode: Rejective"
  • General mode: For non-safety-related requests → Use system prompt: "Adherence mode: Strict adherence" This release enables researchers and developers to explore switchable safety control in a secure and transparent manner, while mitigating misuse risks.

For full details, model card, and usage examples, please visit: 👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B

Citation

If you use this repository, please cite the paper below.

@misc{si2025efficientswitchablesafetycontrol,
      title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training}, 
      author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
      year={2025},
      eprint={2508.14904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14904}, 
}

Contact

For any question, feel free to reach out via the email listed in the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages