Open Problems in Mechanistic Interpretability

Published on: April 3, 2025

This is my summary of the report on Open Problems in Mechanistic Interpretability

Difference Between Interpretability and Mechanistic Interpretability

Interpretability refers to methods for understanding neural networks from the inside out.
Mechanistic Interpretability is a specific approach to interpretability that focuses on understanding the internal mechanisms of a neural network

Three Broad Approaches to Interpretability in AI:

Interpretability by Design – Creating models that are inherently more interpretable (e.g., decision trees, rule-based systems).
Why Did the Model Make This Decision?
How Does the Model Solve a General Class of Problems? (Mechanistic Interpretability)

Mechanistic Interpretability – Core Goal

The primary aim is to decompose a neural network and study its components in isolation. This helps us explain how neural networks generalize and make decisions at a fundamental level. By understanding the internal computations of neural networks we can safely deploy them in safety-critical and ethically-sensitive domains. Additionally it also allows us to create and study aritifical minds with a level of access and control that is not possible with human minds.

Two widely used method of performing mechanistic interpretability are:

Reverse Engineering: Decompose a network into components and then identify the role of each component.
- Why reverse engineer? Humans and neural networks often use different representations. For example, while humans solve modular addition using simple carries, a small transformer model learned a Fourier transform strategy instead [1].
- Steps of Performing Reverse Engineering:
  - Decomposition of Network into Smaller Components
    - Individual neurons and attention heads often exhibit polysemanticity, and some research suggests that representations in language models can span across multiple layers. Therefore, decomposing neural network representations solely based on neurons, attention heads, or layers may not provide a natural or effective approach. A common strategy is to provide the model with a range of unlabeled inputs, collect the resulting hidden activations, and then apply unsupervised dimensionality reduction techniques to those activations. However, if neurons are in superposition (where more features are encoded than there are neurons), traditional dimensionality reduction techniques, such as Principal Component Analysis (PCA), may fail.
    - To address the challenge of superposition, methods like Sparse Dictionary Learning (SDL) are used. SDL can represent more features than there are dimensions, as long as each feature activates sparsely. Various methods fall under this umbrella, such as Sparse Autoencoders (SAE), Transcoders, and Crosscoders. The primary goal in SDL is to train the dictionary elements to align with the ‘feature directions’ in the model’s activations. The objective is typically to reconstruct input, the next layer's activations, or the activations across multiple layers simultaneously.
    - Limitations with Sparse Dictionary Learning (SDL):
    - Given the practical and conceptual challenges with SDL, an important question arises: How can we decompose a network into atomic units? After investing considerable effort into SDL approaches, it’s clear that improving conceptual clarity beyond the concept of superposition is crucial to advancing neural network decomposition.
  - Interpretation: Formulating a Hypothesis about the Function of Each Component
    - After decomposing a network into components, the next step is to hypothesize the functional role of each component.
    - Two Approaches for Formulating Hypotheses:
      - What Causes Their Activation?
        
        Highly activating dataset examples
        
        Potential issues: Many of the issues with highly activating data set examples stem from the fact that they merely provide correlational explanations for the activation of a network component, rather than causal explanations.
        
        Human Bias: This method relies on human prior beliefs, which may lead interpreters to project human understanding onto models that could be using completely unfamiliar concepts.
        
        Interpretability Illusions: There is a risk of constructing interpretability illusions, where plausible explanations based on dataset examples are mistaken for fundamental truths.
        
        Highly activating dataset examples cannot be solely relied upon to identify the basic units of computation in neural networks.
        
        Attrbution Methods Many gradient-based methods identify only a first-order approximation of the ideal attribution, which is sometimes inaccurate. Developing efficient and accurate attribution methods is an open problem in mechanistic interpretability.
        
        Feature Synthesis Feature synthesis is a strategy that integrates highly activating dataset examples and gradient-based attribution methods to form more comprehensive hypotheses.
      - What happens after that component has been activated?: to be written...
  - Validation of description: Test the hypothesis
Concept-based interpretability: Find a set of concepts and then see which components contribute the activation of such concepts. Concepts can thought of as a feature of the data.

Open Questions:

How valid is the superposition hypothesis? Is it fundamentally valid, or merely pragmatically useful?
Theories of Generalization: Given the field’s objective to understand the learned structures behind neural networks' generalization behaviors, exploring theories about why networks generalize in the way they do seems promising. Thus, stronger theoretical foundations may also be essential for developing models that are intrinsically decomposable by design.

References

Progress measures for grokking via mechanistic interpretability| Here