Skip to content

ashioyajotham/greater-than-circuit

Repository files navigation

Reverse-Engineering the Greater Than Circuit

A mechanistic interpretability project to understand how GPT-2 Small performs numerical comparisons

Acknowledgments

This project builds upon the foundational research and methodologies developed by Neel Nanda and the broader mechanistic interpretability community. We are deeply grateful for:

  • Neel Nanda's pioneering work in transformer mechanistic interpretability and the development of TransformerLens
  • The TransformerLens library that made this analysis possible
  • Activation patching techniques refined through community research
  • Circuit analysis methodologies from the interpretability research community
  • The broader mechanistic interpretability research that provides the theoretical foundation for this work

Special acknowledgment to Neel Nanda for advancing the field of mechanistic interpretability and making tools accessible to researchers worldwide.

Table of Contents

Overview

This project implements a comprehensive mechanistic interpretability analysis to reverse-engineer the "greater than" circuit in GPT-2 Small. Using activation patching techniques, we identify and analyze the specific neurons, attention heads, and pathways responsible for numerical comparison capabilities.

Objective

Identify and understand the circuit responsible for the "greater than" capability in GPT-2 Small, mapping out the complete computational pathway from input to output.

Key Research Questions

  1. Which specific components (attention heads, MLP layers) are crucial for greater than comparisons?
  2. How do these components interact to perform the numerical reasoning?
  3. What is the information flow through the identified circuit?
  4. How robust is this circuit to different inputs and perturbations?

Research Themes

1. Circuit Identification

  • Activation Patching: Systematically replace activations to isolate critical components
  • Component Analysis: Identify attention heads, MLP layers, and residual connections
  • Layer-wise Investigation: Understand how processing evolves through model depth

2. Mechanistic Understanding

  • Information Flow: Trace how numerical information propagates through the network
  • Attention Patterns: Analyze what tokens different heads attend to during comparison
  • Feature Detection: Understand what features each component extracts

3. Circuit Validation

  • Necessity Testing: Verify that identified components are essential for the task
  • Sufficiency Analysis: Determine if the identified circuit is complete
  • Robustness Evaluation: Test circuit behavior under various conditions

4. Generalization Analysis

  • Cross-task Transfer: How does the circuit behave on related numerical tasks?
  • Scale Invariance: Does the circuit work across different number ranges?
  • Template Robustness: Performance across different prompt formats

Methodology Workflow

graph TB
    A[Model Setup<br/>GPT-2 Small + TransformerLens] --> B[Prompt Design<br/>Number pairs + True/False labels]
    B --> C[Activation Patching<br/>Systematic component isolation]
    C --> D[Circuit Analysis<br/>Component identification & ranking]
    D --> E[Information Flow Analysis<br/>Attention patterns & pathways]
    E --> F[Circuit Validation<br/>Necessity & sufficiency testing]
    F --> G[Visualization<br/>Interactive diagrams & plots]
    G --> H[Results & Insights<br/>Complete circuit understanding]
    
    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style F fill:#fff3e0
    style H fill:#fce4ec
Loading

Detailed Process Flow

Phase 1: Setup & Preparation

  1. Model Loading: Initialize GPT-2 Small with interpretability-friendly settings
  2. Prompt Generation: Create balanced datasets of numerical comparisons
  3. Baseline Testing: Establish model performance on the task

Phase 2: Circuit Discovery

  1. Comprehensive Patching: Test all major components across all positions
  2. Effect Ranking: Identify components with highest impact on task performance
  3. Component Clustering: Group related components by function and layer

Phase 3: Deep Analysis

  1. Attention Analysis: Study attention patterns in critical heads
  2. Activation Inspection: Examine internal representations
  3. Information Flow: Map connections between components

Phase 4: Validation & Testing

  1. Ablation Studies: Remove components to test necessity
  2. Robustness Testing: Evaluate performance under perturbations
  3. Generalization Testing: Test on related numerical tasks

Installation

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended)
  • 8GB+ RAM

Install Dependencies

# Clone the repository
git clone https://github.com/ashioyajotham/greater-than-circuit.git
cd greater-than-circuit

# Install dependencies
pip install -r requirements.txt

# Or install with development dependencies
pip install -e ".[dev]"

Verify Installation

from src.model_setup import ModelSetup

# Test model loading
setup = ModelSetup()
model = setup.load_model()
setup.print_model_info()

Quick Start

1. Basic Circuit Analysis

from src import ModelSetup, PromptGenerator, ActivationPatcher, CircuitAnalyzer

# Initialize components
setup = ModelSetup()
model = setup.load_model()
generator = PromptGenerator(seed=42)
patcher = ActivationPatcher(model)
analyzer = CircuitAnalyzer(model)

# Generate test data
pairs = generator.create_prompt_pairs(n_pairs=10)
clean_tokens = model.to_tokens(pairs[0][0].prompt_text + " ")
corrupted_tokens = model.to_tokens(pairs[0][1].prompt_text + " ")

# Run activation patching
results = patcher.comprehensive_patching(
    corrupted_tokens=corrupted_tokens,
    clean_tokens=clean_tokens,
    component_types=["attn", "mlp"]
)

# Analyze results
components = analyzer.identify_circuit_components(results)
summary = analyzer.create_circuit_summary(results)

print(f"Identified {len(components)} critical components")
print(f"Circuit spans {summary['circuit_depth']} layers")

2. Generate Visualizations

from src.visualization import CircuitVisualizer

visualizer = CircuitVisualizer()

# Plot patching results
fig1 = visualizer.plot_patching_results(results, save_path="results/patching.png")

# Create circuit diagram
fig2 = visualizer.plot_circuit_diagram(components, save_path="results/circuit.html")

# Generate interactive dashboard
fig3 = visualizer.create_summary_dashboard(summary, save_path="results/dashboard.html")

3. Comprehensive Validation

from src.circuit_validation import CircuitValidator

validator = CircuitValidator(model, generator, patcher, analyzer)

# Run full validation suite
validation_results = validator.run_comprehensive_validation(n_test_examples=200)

# Generate detailed report
report = validator.generate_validation_report(
    validation_results, 
    save_path="results/validation_report.txt"
)
print(report)

Detailed Usage

Custom Prompt Generation

# Generate specific test cases
generator = PromptGenerator(seed=42)

# Balanced dataset
examples = generator.generate_balanced_dataset(n_examples=1000, num_range=(1, 100))

# Edge cases
edge_cases = generator.generate_edge_cases()

# Custom template
custom_examples = generator.generate_basic_examples(
    n_examples=100, 
    template_idx=1  # "Is X greater than Y?"
)

# Print statistics
generator.print_statistics(examples)

Advanced Circuit Analysis

# Analyze specific attention heads
attention_patterns = analyzer.analyze_attention_patterns(
    tokens=clean_tokens,
    target_heads=[(5, 7), (3, 2), (7, 11)]  # Specific (layer, head) pairs
)

# Compute activation attributions
attributions = analyzer.compute_activation_attribution(
    clean_tokens=clean_tokens,
    corrupted_tokens=corrupted_tokens
)

# Information flow analysis
flow_matrix = analyzer.find_information_flow(
    tokens=clean_tokens,
    source_components=["L3_mlp", "L5H7"],
    target_components=["L7H2", "L9_mlp"]
)

Custom Validation Tests

# Test specific robustness
robustness_results = validator.validate_robustness(
    base_examples=examples[:100],
    perturbation_types=["number_range", "prompt_template"]
)

# Circuit necessity with custom threshold
necessity_result = validator.validate_circuit_necessity(
    test_examples=examples[:50],
    circuit_components=components,
    ablation_type="mean_ablation"
)

Project Structure

greater-than-circuit/
├── src/                          # Core implementation
│   ├── __init__.py              # Package initialization
│   ├── model_setup.py           # Model loading & configuration
│   ├── prompt_design.py         # Test case generation
│   ├── activation_patching.py   # Patching experiments
│   ├── circuit_analysis.py      # Circuit identification & analysis
│   ├── visualization.py         # Plotting & visualization tools
│   └── circuit_validation.py    # Validation & testing framework
├── notebooks/                    # Jupyter notebooks for exploration
├── tests/                        # Unit tests
├── results/                      # Output directory for results
├── requirements.txt              # Python dependencies
├── pyproject.toml               # Project configuration
└── README.md                    # This file

Core Modules

  • model_setup.py: Handles GPT-2 Small loading with TransformerLens
  • prompt_design.py: Generates numerical comparison test cases
  • activation_patching.py: Implements activation patching experiments
  • circuit_analysis.py: Analyzes patching results to identify circuits
  • visualization.py: Creates plots and interactive visualizations
  • circuit_validation.py: Comprehensive testing and validation

Results and Findings

Expected Circuit Structure

Based on mechanistic interpretability research, we expect to find:

  1. Early Layer Processing: Token recognition and basic numerical encoding
  2. Middle Layer Comparison: Attention heads that compare numerical values
  3. Late Layer Integration: MLP layers that compute the final comparison result
  4. Output Projection: Final layers that map to True/False tokens

Typical Performance Metrics

  • Baseline Accuracy: 85-95% on balanced greater than tasks
  • Circuit Coverage: 15-25 critical components across 6-8 layers
  • Attention Patterns: Specific heads attending to numerical tokens
  • Robustness: 10-20% accuracy drop under perturbations

Visualization Outputs

The project generates several types of visualizations:

  1. Patching Results Plot: Bar charts showing component importance
  2. Circuit Diagram: Interactive network showing component connections
  3. Attention Heatmaps: Visualization of attention patterns
  4. Layer Analysis: Charts showing layer-wise contributions
  5. Summary Dashboard: Comprehensive interactive overview

Contributing

We welcome contributions to improve the analysis and extend the methodology!

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-analysis)
  3. Commit your changes (git commit -m 'Add amazing analysis')
  4. Push to the branch (git push origin feature/amazing-analysis)
  5. Open a Pull Request

Areas for Contribution

  • New Analysis Techniques: Novel interpretability methods
  • Additional Validation: More comprehensive testing frameworks
  • Visualization Improvements: Better plots and interactive tools
  • Documentation: Examples, tutorials, and guides
  • Performance Optimization: Faster patching and analysis methods

Development Setup

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black src/
flake8 src/

📚 References and Further Reading

Key Papers

  • "A Mathematical Framework for Transformer Circuits" - Anthropic
  • "In-context Learning and Induction Heads" - Anthropic
  • "Locating and Editing Factual Associations in GPT" - ROME paper
  • "Interpretability in the Wild" - Various mechanistic interpretability papers

Related Work

  • TransformerLens Documentation: https://transformerlens.readthedocs.io/
  • Neel Nanda's Blog: Mechanistic interpretability tutorials and insights
  • Anthropic Interpretability Research: Circuit analysis methodologies

Tools and Libraries

  • TransformerLens: Core interpretability library
  • Circuitsvis: Visualization tools for transformer circuits
  • PyTorch: Deep learning framework
  • Plotly: Interactive visualization library

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎓 Citation

If you use this work in your research, please cite:

@software{greater_than_circuit_2025,
  title={Reverse-Engineering the Greater Than Circuit in GPT-2 Small},
  author={Ashioya Jotham Victor},
  year={2025},
  url={https://github.com/ashioyajotham/greater-than-circuit},
  note={Built with TransformerLens by Neel Nanda}
}

Tips for Success

Getting Started

  1. Start Small: Begin with a few prompt pairs and basic patching
  2. Validate Early: Check that your prompts work as expected
  3. Visualize Often: Use plots to understand your results
  4. Document Everything: Keep notes on interesting findings

Common Pitfalls

  • Token Alignment: Ensure clean and corrupted prompts have same token structure
  • Batch Effects: Test with different random seeds for robustness
  • Memory Usage: Large patching experiments can be memory-intensive
  • Interpretation: Always validate circuit hypotheses with multiple tests

Advanced Techniques

  • Multi-token Analysis: Extend beyond single token positions
  • Cross-model Comparison: Test circuits across different model sizes
  • Causal Interventions: Use more sophisticated intervention techniques
  • Mechanistic Hypotheses: Develop and test specific theories about computation

This research was conducted using the TransformerLens library and methodology developed by Neel Nanda and the mechanistic interpretability community.

About

Reverse engineering the circuit responsible for the "greater than" capability in a language model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published