Developmental Interpretability

Published on: June 10, 2024
Phase transition: When the information of the new data sample is incorprated into the network weights, then the posterior can shift drastically. These phase transitions can be thought as an internal model selection.
Fig 1: Phase transition visualization in neural networks [1]

Developmental Interpretability

Developmental Interpretability studies the structural formation and evolution of neural networks during training. It helps understand how networks develop their internal representation and organize information during training.

"We term this developmental interpretability because of the parallel with developmental biology, which aims to understand the final state of a different class of complex self-assembling systems (living organisms) by analyzing the key steps in development from an embryonic state." [1]

The goal of developmental interpretability in the context of alignment is to:

References