Skip to content

princeton-nlp/AttentionRules

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extracting rule-based descriptions of attention features in transformers

This repository contains the code for our paper, "Extracting rule-based descriptions of attention features in transformers". Please see our paper for more details.

Quick links

Setup

Install PyTorch and then install the remaining requirements: pip install -r requirements.txt. This code was tested using Python 3.12 and PyTorch version 2.3.1.

Attention output SAEs

We train attention output SAEs for every attention head in GPT-2 small, using a fork of https://github.com/ckkissane/attention-output-saes. These SAEs can be downloaded from: https://huggingface.co/danf0/attention-head-saes/.

Data

Code for generating datasets of feature activations can be found in src/get_exemplars.py. See scripts/generate_data.sh for the command to generate the datasets used in our paper, which are based on OpenWebText.

Rule extraction

Code for extracting and evaluating skip-gram rules can be found in src/run_rules.py. For example, the following command will extract rules for 10 features from head 0 in layer 0.

python src/run_rules.py \
    --layer 0 \
    --head 0 \
    --num_features 10 \
    --rule_type "v1" \
    --output_dir "output/skipgrams/L0H0";

Code for finding and generating rules containing "distractor" features is in src/find_distractors.py and src/generate_distractors.py The scripts directory contains example contains example commands for running these scripts.

Questions?

If you have any questions about the code or paper, please email Dan ([email protected]) or open an issue.

Citation

@article{friedman2025extracting,
    title={Extracting rule-based descriptions of attention features in transformers},
    author={Friedman, Dan and Wettig, Alexander and Bhaskar, Adithya and Chen, Danqi},
    journal={arXiv preprint},
    year={2025}
}

About

Extracting rule-based descriptions of attention rules

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published