Polysemantic Neurons

Published on: October 6, 2024 | Updated: April 3, 2025

While conducting a literature review, I came across a word that caught my attention and seemed quite interesting. I thought it would be worthwhile to introduce some relevant terminologies here.

When neurons in a neural network correspond to more than one feature, they are called polysemantic neurons. Conversely, if they correspond to only a single feature, they are referred to as monosemantic neurons.

The word 'feature' wasn't easy for me to decode at first. While features can be described as properties interpretable by humans, it cannot be guaranteed that all features are understandable. What may appear as a random signal to a human could represent an important piece of information that human knowledge has failed to decode. One definition of a feature in thWhy reverse engineer? Humans and neural networks often use different representations. For example, while humans solve modular addition using simple carries, a small transformer model learned a Fourier transform strategy insteade context of neural networks that I find fulfilling is:

Properties of the input which a sufficiently large neural network will reliably dedicate a neuron to representing. For example, curve detectors appear to reliably occur across sufficiently sophisticated vision models, and so are a feature. [1]
One interesting finding was that polysemantic neurons tend to occur when the input data exhibits high kurtosis or sparsity. Additionally, their prevalence varies according to the type of architecture used [2]. Features with similar importance levels are often in superposition, whereas features with high importance are more commonly assigned to a dedicated neuron.

Polysemanticity and Superposition

Polysemanticity describes the observed property where learned features within a neural network are not basis-aligned, often encoding multiple, distinct concepts. Superposition, on the other hand, is a phenomenon where a layer attempts to represent more features than its dimensionality allows. Consequently, this packing of features into a limited space inevitably leads to a high degree of polysemanticity. Thus, superposition can be understood as a specific instance where the phenomenon drives the observed property of polysemanticity to an extreme.

Circuits

Circuits in neural networks are graphs responsible for performing a single task, composed of interconnected neurons that process input signals to produce output. To obtain these circuits, we can use techniques like Layer-wise Relevance Propagation (LRP), which operates by propagating predictions backward through the network. This method helps identify which input features are most relevant to the model's predictions.

Fig 2: Layer-wise Relevance Propagation (LRP) in action. (I forgot the website name but will try to remember and cite it.)

Explainable AI demos: Here

References

Toy Models of Superposition | Here
Polysemanticity and Capacity in Neural Networks | Here