Sometime near the end of last year, I came across a blog post by Scott Alexander giving an overview of Anthropic’s recent work on language model interpretability.
This is fantastic work. Thank you for sharing!
re: 13. Where the interpretation of feature explanation can also be done per: Bills et al. (2023)
The autointerpretability procedure takes samples of text where the dictionary feature activates, asks a language model to write a
human-readable interpretation of the dictionary feature, and then prompts the language model to use
this description to predict the dictionary feature’s activation on other samples of text. The correlation
between the model’s predicted activations and the actual activations is that feature’s interpretability
Score
This is fantastic work. Thank you for sharing!
re: 13. Where the interpretation of feature explanation can also be done per: Bills et al. (2023)
The autointerpretability procedure takes samples of text where the dictionary feature activates, asks a language model to write a
human-readable interpretation of the dictionary feature, and then prompts the language model to use
this description to predict the dictionary feature’s activation on other samples of text. The correlation
between the model’s predicted activations and the actual activations is that feature’s interpretability
Score