Reverse-Engineering the Greater Than Circuit

A mechanistic interpretability project to understand how GPT-2 Small performs numerical comparisons

Acknowledgments

This project builds upon the foundational research and methodologies developed by Neel Nanda and the broader mechanistic interpretability community. We are deeply grateful for:

Neel Nanda's pioneering work in transformer mechanistic interpretability and the development of TransformerLens
The TransformerLens library that made this analysis possible
Activation patching techniques refined through community research
Circuit analysis methodologies from the interpretability research community
The broader mechanistic interpretability research that provides the theoretical foundation for this work

Special acknowledgment to Neel Nanda for advancing the field of mechanistic interpretability and making tools accessible to researchers worldwide.

Overview

This project implements a comprehensive mechanistic interpretability analysis to reverse-engineer the "greater than" circuit in GPT-2 Small. Using activation patching techniques, we identify and analyze the specific neurons, attention heads, and pathways responsible for numerical comparison capabilities.

Objective

Identify and understand the circuit responsible for the "greater than" capability in GPT-2 Small, mapping out the complete computational pathway from input to output.

Key Research Questions

Which specific components (attention heads, MLP layers) are crucial for greater than comparisons?
How do these components interact to perform the numerical reasoning?
What is the information flow through the identified circuit?
How robust is this circuit to different inputs and perturbations?

Research Themes

1. Circuit Identification

Activation Patching: Systematically replace activations to isolate critical components
Component Analysis: Identify attention heads, MLP layers, and residual connections
Layer-wise Investigation: Understand how processing evolves through model depth

2. Mechanistic Understanding

Information Flow: Trace how numerical information propagates through the network
Attention Patterns: Analyze what tokens different heads attend to during comparison
Feature Detection: Understand what features each component extracts

3. Circuit Validation

Necessity Testing: Verify that identified components are essential for the task
Sufficiency Analysis: Determine if the identified circuit is complete
Robustness Evaluation: Test circuit behavior under various conditions

4. Generalization Analysis

Cross-task Transfer: How does the circuit behave on related numerical tasks?
Scale Invariance: Does the circuit work across different number ranges?
Template Robustness: Performance across different prompt formats

Methodology Workflow

graph TB
    A[Model Setup<br/>GPT-2 Small + TransformerLens] --> B[Prompt Design<br/>Number pairs + True/False labels]
    B --> C[Activation Patching<br/>Systematic component isolation]
    C --> D[Circuit Analysis<br/>Component identification & ranking]
    D --> E[Information Flow Analysis<br/>Attention patterns & pathways]
    E --> F[Circuit Validation<br/>Necessity & sufficiency testing]
    F --> G[Visualization<br/>Interactive diagrams & plots]
    G --> H[Results & Insights<br/>Complete circuit understanding]
    
    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style F fill:#fff3e0
    style H fill:#fce4ec

Detailed Process Flow

Phase 1: Setup & Preparation

Model Loading: Initialize GPT-2 Small with interpretability-friendly settings
Prompt Generation: Create balanced datasets of numerical comparisons
Baseline Testing: Establish model performance on the task

Phase 2: Circuit Discovery

Comprehensive Patching: Test all major components across all positions
Effect Ranking: Identify components with highest impact on task performance
Component Clustering: Group related components by function and layer

Phase 3: Deep Analysis

Attention Analysis: Study attention patterns in critical heads
Activation Inspection: Examine internal representations
Information Flow: Map connections between components

Phase 4: Validation & Testing

Ablation Studies: Remove components to test necessity
Robustness Testing: Evaluate performance under perturbations
Generalization Testing: Test on related numerical tasks

Installation

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended)
8GB+ RAM

Install Dependencies

# Clone the repository
git clone https://github.com/ashioyajotham/greater-than-circuit.git
cd greater-than-circuit

# Install dependencies
pip install -r requirements.txt

# Or install with development dependencies
pip install -e ".[dev]"

Verify Installation

from src.model_setup import ModelSetup

# Test model loading
setup = ModelSetup()
model = setup.load_model()
setup.print_model_info()

Quick Start

1. Basic Circuit Analysis

from src import ModelSetup, PromptGenerator, ActivationPatcher, CircuitAnalyzer

# Initialize components
setup = ModelSetup()
model = setup.load_model()
generator = PromptGenerator(seed=42)
patcher = ActivationPatcher(model)
analyzer = CircuitAnalyzer(model)

# Generate test data
pairs = generator.create_prompt_pairs(n_pairs=10)
clean_tokens = model.to_tokens(pairs[0][0].prompt_text + " ")
corrupted_tokens = model.to_tokens(pairs[0][1].prompt_text + " ")

# Run activation patching
results = patcher.comprehensive_patching(
    corrupted_tokens=corrupted_tokens,
    clean_tokens=clean_tokens,
    component_types=["attn", "mlp"]
)

# Analyze results
components = analyzer.identify_circuit_components(results)
summary = analyzer.create_circuit_summary(results)

print(f"Identified {len(components)} critical components")
print(f"Circuit spans {summary['circuit_depth']} layers")

2. Generate Visualizations

from src.visualization import CircuitVisualizer

visualizer = CircuitVisualizer()

# Plot patching results
fig1 = visualizer.plot_patching_results(results, save_path="results/patching.png")

# Create circuit diagram
fig2 = visualizer.plot_circuit_diagram(components, save_path="results/circuit.html")

# Generate interactive dashboard
fig3 = visualizer.create_summary_dashboard(summary, save_path="results/dashboard.html")

3. Comprehensive Validation

from src.circuit_validation import CircuitValidator

validator = CircuitValidator(model, generator, patcher, analyzer)

# Run full validation suite
validation_results = validator.run_comprehensive_validation(n_test_examples=200)

# Generate detailed report
report = validator.generate_validation_report(
    validation_results, 
    save_path="results/validation_report.txt"
)
print(report)

Detailed Usage

Custom Prompt Generation

# Generate specific test cases
generator = PromptGenerator(seed=42)

# Balanced dataset
examples = generator.generate_balanced_dataset(n_examples=1000, num_range=(1, 100))

# Edge cases
edge_cases = generator.generate_edge_cases()

# Custom template
custom_examples = generator.generate_basic_examples(
    n_examples=100, 
    template_idx=1  # "Is X greater than Y?"
)

# Print statistics
generator.print_statistics(examples)

Advanced Circuit Analysis

# Analyze specific attention heads
attention_patterns = analyzer.analyze_attention_patterns(
    tokens=clean_tokens,
    target_heads=[(5, 7), (3, 2), (7, 11)]  # Specific (layer, head) pairs
)

# Compute activation attributions
attributions = analyzer.compute_activation_attribution(
    clean_tokens=clean_tokens,
    corrupted_tokens=corrupted_tokens
)

# Information flow analysis
flow_matrix = analyzer.find_information_flow(
    tokens=clean_tokens,
    source_components=["L3_mlp", "L5H7"],
    target_components=["L7H2", "L9_mlp"]
)

Custom Validation Tests

# Test specific robustness
robustness_results = validator.validate_robustness(
    base_examples=examples[:100],
    perturbation_types=["number_range", "prompt_template"]
)

# Circuit necessity with custom threshold
necessity_result = validator.validate_circuit_necessity(
    test_examples=examples[:50],
    circuit_components=components,
    ablation_type="mean_ablation"
)

Project Structure

greater-than-circuit/
├── src/                          # Core implementation
│   ├── __init__.py              # Package initialization
│   ├── model_setup.py           # Model loading & configuration
│   ├── prompt_design.py         # Test case generation
│   ├── activation_patching.py   # Patching experiments
│   ├── circuit_analysis.py      # Circuit identification & analysis
│   ├── visualization.py         # Plotting & visualization tools
│   └── circuit_validation.py    # Validation & testing framework
├── notebooks/                    # Jupyter notebooks for exploration
├── tests/                        # Unit tests
├── results/                      # Output directory for results
├── requirements.txt              # Python dependencies
├── pyproject.toml               # Project configuration
└── README.md                    # This file

Core Modules

model_setup.py: Handles GPT-2 Small loading with TransformerLens
prompt_design.py: Generates numerical comparison test cases
activation_patching.py: Implements activation patching experiments
circuit_analysis.py: Analyzes patching results to identify circuits
visualization.py: Creates plots and interactive visualizations
circuit_validation.py: Comprehensive testing and validation

Results and Findings

Expected Circuit Structure

Based on mechanistic interpretability research, we expect to find:

Early Layer Processing: Token recognition and basic numerical encoding
Middle Layer Comparison: Attention heads that compare numerical values
Late Layer Integration: MLP layers that compute the final comparison result
Output Projection: Final layers that map to True/False tokens

Typical Performance Metrics

Baseline Accuracy: 85-95% on balanced greater than tasks
Circuit Coverage: 15-25 critical components across 6-8 layers
Attention Patterns: Specific heads attending to numerical tokens
Robustness: 10-20% accuracy drop under perturbations

Visualization Outputs

The project generates several types of visualizations:

Patching Results Plot: Bar charts showing component importance
Circuit Diagram: Interactive network showing component connections
Attention Heatmaps: Visualization of attention patterns
Layer Analysis: Charts showing layer-wise contributions
Summary Dashboard: Comprehensive interactive overview

Contributing

We welcome contributions to improve the analysis and extend the methodology!

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-analysis)
Commit your changes (git commit -m 'Add amazing analysis')
Push to the branch (git push origin feature/amazing-analysis)
Open a Pull Request

Areas for Contribution

New Analysis Techniques: Novel interpretability methods
Additional Validation: More comprehensive testing frameworks
Visualization Improvements: Better plots and interactive tools
Documentation: Examples, tutorials, and guides
Performance Optimization: Faster patching and analysis methods

Development Setup

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black src/
flake8 src/

📚 References and Further Reading

Key Papers

"A Mathematical Framework for Transformer Circuits" - Anthropic
"In-context Learning and Induction Heads" - Anthropic
"Locating and Editing Factual Associations in GPT" - ROME paper
"Interpretability in the Wild" - Various mechanistic interpretability papers

Related Work

TransformerLens Documentation: https://transformerlens.readthedocs.io/
Neel Nanda's Blog: Mechanistic interpretability tutorials and insights
Anthropic Interpretability Research: Circuit analysis methodologies

Tools and Libraries

TransformerLens: Core interpretability library
Circuitsvis: Visualization tools for transformer circuits
PyTorch: Deep learning framework
Plotly: Interactive visualization library

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎓 Citation

If you use this work in your research, please cite:

@software{greater_than_circuit_2025,
  title={Reverse-Engineering the Greater Than Circuit in GPT-2 Small},
  author={Ashioya Jotham Victor},
  year={2025},
  url={https://github.com/ashioyajotham/greater-than-circuit},
  note={Built with TransformerLens by Neel Nanda}
}

Tips for Success

Getting Started

Start Small: Begin with a few prompt pairs and basic patching
Validate Early: Check that your prompts work as expected
Visualize Often: Use plots to understand your results
Document Everything: Keep notes on interesting findings

Common Pitfalls

Token Alignment: Ensure clean and corrupted prompts have same token structure
Batch Effects: Test with different random seeds for robustness
Memory Usage: Large patching experiments can be memory-intensive
Interpretation: Always validate circuit hypotheses with multiple tests

Advanced Techniques

Multi-token Analysis: Extend beyond single token positions
Cross-model Comparison: Test circuits across different model sizes
Causal Interventions: Use more sophisticated intervention techniques
Mechanistic Hypotheses: Develop and test specific theories about computation

This research was conducted using the TransformerLens library and methodology developed by Neel Nanda and the mechanistic interpretability community.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Reverse-Engineering the _Greater Than_ Circuit.pdf		Reverse-Engineering the _Greater Than_ Circuit.pdf
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

ashioyajotham/greater-than-circuit

Folders and files

Latest commit

History

Repository files navigation