A mechanistic interpretability project to understand how GPT-2 Small performs numerical comparisons
This project builds upon the foundational research and methodologies developed by Neel Nanda and the broader mechanistic interpretability community. We are deeply grateful for:
- Neel Nanda's pioneering work in transformer mechanistic interpretability and the development of TransformerLens
- The TransformerLens library that made this analysis possible
- Activation patching techniques refined through community research
- Circuit analysis methodologies from the interpretability research community
- The broader mechanistic interpretability research that provides the theoretical foundation for this work
Special acknowledgment to Neel Nanda for advancing the field of mechanistic interpretability and making tools accessible to researchers worldwide.
- Overview
- Research Themes
- Methodology Workflow
- Installation
- Quick Start
- Detailed Usage
- Project Structure
- Results and Findings
- Contributing
- References and Further Reading
- License
This project implements a comprehensive mechanistic interpretability analysis to reverse-engineer the "greater than" circuit in GPT-2 Small. Using activation patching techniques, we identify and analyze the specific neurons, attention heads, and pathways responsible for numerical comparison capabilities.
Identify and understand the circuit responsible for the "greater than" capability in GPT-2 Small, mapping out the complete computational pathway from input to output.
- Which specific components (attention heads, MLP layers) are crucial for greater than comparisons?
- How do these components interact to perform the numerical reasoning?
- What is the information flow through the identified circuit?
- How robust is this circuit to different inputs and perturbations?
- Activation Patching: Systematically replace activations to isolate critical components
- Component Analysis: Identify attention heads, MLP layers, and residual connections
- Layer-wise Investigation: Understand how processing evolves through model depth
- Information Flow: Trace how numerical information propagates through the network
- Attention Patterns: Analyze what tokens different heads attend to during comparison
- Feature Detection: Understand what features each component extracts
- Necessity Testing: Verify that identified components are essential for the task
- Sufficiency Analysis: Determine if the identified circuit is complete
- Robustness Evaluation: Test circuit behavior under various conditions
- Cross-task Transfer: How does the circuit behave on related numerical tasks?
- Scale Invariance: Does the circuit work across different number ranges?
- Template Robustness: Performance across different prompt formats
graph TB
A[Model Setup<br/>GPT-2 Small + TransformerLens] --> B[Prompt Design<br/>Number pairs + True/False labels]
B --> C[Activation Patching<br/>Systematic component isolation]
C --> D[Circuit Analysis<br/>Component identification & ranking]
D --> E[Information Flow Analysis<br/>Attention patterns & pathways]
E --> F[Circuit Validation<br/>Necessity & sufficiency testing]
F --> G[Visualization<br/>Interactive diagrams & plots]
G --> H[Results & Insights<br/>Complete circuit understanding]
style A fill:#e1f5fe
style C fill:#f3e5f5
style D fill:#e8f5e8
style F fill:#fff3e0
style H fill:#fce4ec
- Model Loading: Initialize GPT-2 Small with interpretability-friendly settings
- Prompt Generation: Create balanced datasets of numerical comparisons
- Baseline Testing: Establish model performance on the task
- Comprehensive Patching: Test all major components across all positions
- Effect Ranking: Identify components with highest impact on task performance
- Component Clustering: Group related components by function and layer
- Attention Analysis: Study attention patterns in critical heads
- Activation Inspection: Examine internal representations
- Information Flow: Map connections between components
- Ablation Studies: Remove components to test necessity
- Robustness Testing: Evaluate performance under perturbations
- Generalization Testing: Test on related numerical tasks
- Python 3.8+
- CUDA-capable GPU (recommended)
- 8GB+ RAM
# Clone the repository
git clone https://github.com/ashioyajotham/greater-than-circuit.git
cd greater-than-circuit
# Install dependencies
pip install -r requirements.txt
# Or install with development dependencies
pip install -e ".[dev]"
from src.model_setup import ModelSetup
# Test model loading
setup = ModelSetup()
model = setup.load_model()
setup.print_model_info()
from src import ModelSetup, PromptGenerator, ActivationPatcher, CircuitAnalyzer
# Initialize components
setup = ModelSetup()
model = setup.load_model()
generator = PromptGenerator(seed=42)
patcher = ActivationPatcher(model)
analyzer = CircuitAnalyzer(model)
# Generate test data
pairs = generator.create_prompt_pairs(n_pairs=10)
clean_tokens = model.to_tokens(pairs[0][0].prompt_text + " ")
corrupted_tokens = model.to_tokens(pairs[0][1].prompt_text + " ")
# Run activation patching
results = patcher.comprehensive_patching(
corrupted_tokens=corrupted_tokens,
clean_tokens=clean_tokens,
component_types=["attn", "mlp"]
)
# Analyze results
components = analyzer.identify_circuit_components(results)
summary = analyzer.create_circuit_summary(results)
print(f"Identified {len(components)} critical components")
print(f"Circuit spans {summary['circuit_depth']} layers")
from src.visualization import CircuitVisualizer
visualizer = CircuitVisualizer()
# Plot patching results
fig1 = visualizer.plot_patching_results(results, save_path="results/patching.png")
# Create circuit diagram
fig2 = visualizer.plot_circuit_diagram(components, save_path="results/circuit.html")
# Generate interactive dashboard
fig3 = visualizer.create_summary_dashboard(summary, save_path="results/dashboard.html")
from src.circuit_validation import CircuitValidator
validator = CircuitValidator(model, generator, patcher, analyzer)
# Run full validation suite
validation_results = validator.run_comprehensive_validation(n_test_examples=200)
# Generate detailed report
report = validator.generate_validation_report(
validation_results,
save_path="results/validation_report.txt"
)
print(report)
# Generate specific test cases
generator = PromptGenerator(seed=42)
# Balanced dataset
examples = generator.generate_balanced_dataset(n_examples=1000, num_range=(1, 100))
# Edge cases
edge_cases = generator.generate_edge_cases()
# Custom template
custom_examples = generator.generate_basic_examples(
n_examples=100,
template_idx=1 # "Is X greater than Y?"
)
# Print statistics
generator.print_statistics(examples)
# Analyze specific attention heads
attention_patterns = analyzer.analyze_attention_patterns(
tokens=clean_tokens,
target_heads=[(5, 7), (3, 2), (7, 11)] # Specific (layer, head) pairs
)
# Compute activation attributions
attributions = analyzer.compute_activation_attribution(
clean_tokens=clean_tokens,
corrupted_tokens=corrupted_tokens
)
# Information flow analysis
flow_matrix = analyzer.find_information_flow(
tokens=clean_tokens,
source_components=["L3_mlp", "L5H7"],
target_components=["L7H2", "L9_mlp"]
)
# Test specific robustness
robustness_results = validator.validate_robustness(
base_examples=examples[:100],
perturbation_types=["number_range", "prompt_template"]
)
# Circuit necessity with custom threshold
necessity_result = validator.validate_circuit_necessity(
test_examples=examples[:50],
circuit_components=components,
ablation_type="mean_ablation"
)
greater-than-circuit/
├── src/ # Core implementation
│ ├── __init__.py # Package initialization
│ ├── model_setup.py # Model loading & configuration
│ ├── prompt_design.py # Test case generation
│ ├── activation_patching.py # Patching experiments
│ ├── circuit_analysis.py # Circuit identification & analysis
│ ├── visualization.py # Plotting & visualization tools
│ └── circuit_validation.py # Validation & testing framework
├── notebooks/ # Jupyter notebooks for exploration
├── tests/ # Unit tests
├── results/ # Output directory for results
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
└── README.md # This file
model_setup.py
: Handles GPT-2 Small loading with TransformerLensprompt_design.py
: Generates numerical comparison test casesactivation_patching.py
: Implements activation patching experimentscircuit_analysis.py
: Analyzes patching results to identify circuitsvisualization.py
: Creates plots and interactive visualizationscircuit_validation.py
: Comprehensive testing and validation
Based on mechanistic interpretability research, we expect to find:
- Early Layer Processing: Token recognition and basic numerical encoding
- Middle Layer Comparison: Attention heads that compare numerical values
- Late Layer Integration: MLP layers that compute the final comparison result
- Output Projection: Final layers that map to True/False tokens
- Baseline Accuracy: 85-95% on balanced greater than tasks
- Circuit Coverage: 15-25 critical components across 6-8 layers
- Attention Patterns: Specific heads attending to numerical tokens
- Robustness: 10-20% accuracy drop under perturbations
The project generates several types of visualizations:
- Patching Results Plot: Bar charts showing component importance
- Circuit Diagram: Interactive network showing component connections
- Attention Heatmaps: Visualization of attention patterns
- Layer Analysis: Charts showing layer-wise contributions
- Summary Dashboard: Comprehensive interactive overview
We welcome contributions to improve the analysis and extend the methodology!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-analysis
) - Commit your changes (
git commit -m 'Add amazing analysis'
) - Push to the branch (
git push origin feature/amazing-analysis
) - Open a Pull Request
- New Analysis Techniques: Novel interpretability methods
- Additional Validation: More comprehensive testing frameworks
- Visualization Improvements: Better plots and interactive tools
- Documentation: Examples, tutorials, and guides
- Performance Optimization: Faster patching and analysis methods
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black src/
flake8 src/
- "A Mathematical Framework for Transformer Circuits" - Anthropic
- "In-context Learning and Induction Heads" - Anthropic
- "Locating and Editing Factual Associations in GPT" - ROME paper
- "Interpretability in the Wild" - Various mechanistic interpretability papers
- TransformerLens Documentation: https://transformerlens.readthedocs.io/
- Neel Nanda's Blog: Mechanistic interpretability tutorials and insights
- Anthropic Interpretability Research: Circuit analysis methodologies
- TransformerLens: Core interpretability library
- Circuitsvis: Visualization tools for transformer circuits
- PyTorch: Deep learning framework
- Plotly: Interactive visualization library
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@software{greater_than_circuit_2025,
title={Reverse-Engineering the Greater Than Circuit in GPT-2 Small},
author={Ashioya Jotham Victor},
year={2025},
url={https://github.com/ashioyajotham/greater-than-circuit},
note={Built with TransformerLens by Neel Nanda}
}
- Start Small: Begin with a few prompt pairs and basic patching
- Validate Early: Check that your prompts work as expected
- Visualize Often: Use plots to understand your results
- Document Everything: Keep notes on interesting findings
- Token Alignment: Ensure clean and corrupted prompts have same token structure
- Batch Effects: Test with different random seeds for robustness
- Memory Usage: Large patching experiments can be memory-intensive
- Interpretation: Always validate circuit hypotheses with multiple tests
- Multi-token Analysis: Extend beyond single token positions
- Cross-model Comparison: Test circuits across different model sizes
- Causal Interventions: Use more sophisticated intervention techniques
- Mechanistic Hypotheses: Develop and test specific theories about computation
This research was conducted using the TransformerLens library and methodology developed by Neel Nanda and the mechanistic interpretability community.