This repository contains the code for our paper, "Extracting rule-based descriptions of attention features in transformers". Please see our paper for more details.
Install PyTorch and then install the remaining requirements: pip install -r requirements.txt
.
This code was tested using Python 3.12 and PyTorch version 2.3.1.
We train attention output SAEs for every attention head in GPT-2 small, using a fork of https://github.com/ckkissane/attention-output-saes. These SAEs can be downloaded from: https://huggingface.co/danf0/attention-head-saes/.
Code for generating datasets of feature activations can be found in src/get_exemplars.py. See scripts/generate_data.sh for the command to generate the datasets used in our paper, which are based on OpenWebText. The feature datasets used in this paper can be downloaded directly from HuggingFace via this link.
Code for extracting and evaluating skip-gram rules can be found in src/run_rules.py. For example, the following command will extract rules for 10 features from head 0 in layer 0.
python src/run_rules.py \
--layer 0 \
--head 0 \
--num_features 10 \
--rule_type "v1" \
--output_dir "output/skipgrams/L0H0";
Code for finding and generating rules containing "distractor" features is in src/find_distractors.py and src/generate_distractors.py The scripts directory contains example contains example commands for running these scripts.
If you have any questions about the code or paper, please email Dan ([email protected]) or open an issue.
@article{friedman2025extracting,
title={Extracting rule-based descriptions of attention features in transformers},
author={Friedman, Dan and Wettig, Alexander and Bhaskar, Adithya and Chen, Danqi},
journal={arXiv preprint},
year={2025}
}