AI Risks
•
Apr 17, 2024
•
5 min read
Representation Engineering: a New Way of Understanding Models
Representation engineering is an exciting new field which explores how we can better understand traits like honesty, power seeking, and morality in LLMs. We show that these traits can be identified by looking at model activations, and these same traits can also be controlled. This method differs from mechanistic approaches which focus on bottom-up interpretations of node to node connections. In contrast, representation engineering looks at larger chunks of representations and higher-level mechanisms to understand models in a 'top-down' fashion.
Written by:
Izzy Barrass, Long Phan