Hi, I was reviewing this post by OpenAI about extracting concepts from GPT-4:
https://openai.com/index/extracting-concepts-from-gpt-4/
Which references this paper: [2406.04093] Scaling and evaluating sparse autoencoders
And closely related to this one: [2309.08600] Sparse Autoencoders Find Highly Interpretable Features in Language Models
Wondering if the brute-force approach is gradually integrating features of the reverse-engineering approach so they’re converging?
Hey there, @Aleksandar_Kamburov . Welcome!
Interesting article from OpenAI. I’ll need to spend someo more time reading it later. At a glance, yeah, I can see where some convergence is happening: namely in things like content identification and bias detection.
That said, I wouldn’t expect an auto encoding approach to be competely fool proof (then again, no single approach is perfect). I expect auto encoders wouldn’t be great at things like contextual reasoning or understanding casual relationships. It also probably wouldn’t be the greatest at understanding novelty.
Hi @Aleksandar_Kamburov! This is Hojae, one of TBP’s Researchers. Welcome to the community!
Great question about the convergence between these approaches. We’re familiar with some works in the field of mechanistic interpretation and Sparse Autoencoder (SAE) techniques for understanding LLMs (e.g. there is also some work done by Anthropic’s Interpretability team, such as Scaling Monosemanticity).
The SAE approach is fascinating because it attempts to identify the “concepts” learned by LLMs after they’ve been trained. While OpenAI’s work identified an impressive 16 million concepts from GPT-4, there are still important limitations. These concepts often require human labeling to be meaningful, and the discovered features can be difficult to interpret consistently. Sometimes what appears to be a single concept shows unexpected activations alongside seemingly unrelated concepts. Second, there’s certainly movement toward convergence with reverse-engineering approaches, but we’re not quite there yet. It may be possible that in the near future the DL community uses techniques from previous “steerability” papers in generative networks (e.g., semantic factorization of StyleGANs and GANspace to causally influence model behavior by activating specific features (e.g. in the paper the Anthropic linked above, they have “steered” Claude to think it is a Golden Gate Bridge by maximizing this activation), but we don’t yet have a clear path to fully reverse-engineer an LLM based on SAE-discovered concepts.
One interesting philosophical difference is our perspective on “polysemanticity” (the idea of neurons representing multiple concepts) and “monosemanticity” (the idea of neurons forming one-to-one correlation with features or concepts) from the field of mechanistic interpretation. In the human brain, there isn’t a specific neuron dedicated to a particular concept. Rather than attempting to extract concepts post-hoc from a trained system, Monty learns structured representations through sensorimotor interaction with its environment. The concepts or objects that Monty learns are directly acquired through experience rather than emerging from subsequent analysis. Our approach is based on reference frames and embodied learning, which we believe more closely mirrors how biological intelligence develops.
That said, it’s great to see efforts to better understand the inner workings of neural networks. ![]()
Hi @hlee, thank you for your insights. It was interesting to understand the differences you’re seeing. Also, I think this is the first time I hear about polysemanticity.
Much appreciated!
Alex
Of course! I highly recommend the Scaling Monosemanticity paper from Anthropic, and you may be interested in their latest/follow-up paper on The Biology of Large Language Model. ![]()