Discussion about this post

User's avatar
London's avatar

This is fantastic work. Thank you for sharing!

Expand full comment
Jon Clement's avatar

re: 13. Where the interpretation of feature explanation can also be done per: Bills et al. (2023)

The autointerpretability procedure takes samples of text where the dictionary feature activates, asks a language model to write a

human-readable interpretation of the dictionary feature, and then prompts the language model to use

this description to predict the dictionary feature’s activation on other samples of text. The correlation

between the model’s predicted activations and the actual activations is that feature’s interpretability

Score

Expand full comment

No posts