Monosemanticity at Home: My Attempt at…

Apr 30, 2024

Sometime near the end of last year, I came across a blog post by Scott Alexander giving an overview of Anthropic’s recent work on language model interpretability.

Read →

2 Comments

London

Apr 12

This is fantastic work. Thank you for sharing!

Expand full comment

Jon Clement

Jan 8

re: 13. Where the interpretation of feature explanation can also be done per: Bills et al. (2023)

The autointerpretability procedure takes samples of text where the dictionary feature activates, asks a language model to write a

human-readable interpretation of the dictionary feature, and then prompts the language model to use

this description to predict the dictionary feature’s activation on other samples of text. The correlation

between the model’s predicted activations and the actual activations is that feature’s interpretability

Score

Expand full comment

Jake Ward's Blog

Monosemanticity at Home: My Attempt at…