AI Alignment

Published on: July 19, 2024

Some notes from AI alignment class

AI alignment: A subfield of AI safety which I like to view from the lens of robustness, scalable oversight, and Mechanistic Interpretability. I try to avoid the use of alignment with human values because different humans can have different values which can conflict sometimes.

Types of AI alignment:
Fig 1: Types of AI alignment [1]
Goodhart's law: Qualia
It is subjective, individual experiences of perception and sensation. For example, when you see a red apple, the redness you perceive is a quale. Even if we understand the neural processes and wavelengths of light involved in seeing red, it doesn't fully explain the subjective experience of redness itself.

Four background claims (Machine Intelligent Research Institute (MIRI)) [2] To sum-up:
These four claims form the core of the argument that artificial intelligence is important: there is such a thing as general reasoning ability; if we build general reasoners, they could be far smarter than humans; if they are far smarter than humans, they could have an immense impact; and that impact will not be beneficial by default.

References