Developmental Interpretability

Published on: June 10, 2024

Phase transition: When the information of the new data sample is incorprated into the network weights, then the posterior can shift drastically. These phase transitions can be thought as an internal model selection.

Fig 1: Phase transition visualization in neural networks [1]

Developmental Interpretability

Developmental Interpretability studies the structural formation and evolution of neural networks during training. It helps understand how networks develop their internal representation and organize information during training.

"We term this developmental interpretability because of the parallel with developmental biology, which aims to understand the final state of a different class of complex self-assembling systems (living organisms) by analyzing the key steps in development from an embryonic state." [1]

The goal of developmental interpretability in the context of alignment is to:

advance the science of detecting when structural changes happen during training,
localize these changes to a subset of the weights
give the changes their proper context within the broader set of computational structures in the current state of the network.

References

Towards Developmental Interpretability| Here