Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And then a friend on the team put in an image of the Golden Gate Bridge.
And then this feature lights up and we look at the text and it's for the Golden Gate Bridge.
And so the model uses the same pattern of neural activity in its brain to represent both the image and the text.
And our circuits work shows this again with across multiple languages.
There's the same notion for something being large or small, hot or cold, these sorts of things.
Yeah.
Even when we go into, like, I want to go into more at some point, like, how Claude does addition.
When you look at the bigger models, it just has a much crisper lookup table for how to add, like, the number 5 and 9 together and get something like 10 modulo 6, 6 modulo 10.
Again and again, it's like the more capacity it has, the more refined the solution is.
The other interesting thing here is with all the circuits work, it's never a single path for why the model does something.
It's always multiple paths, and some of them are deeper than others.
So when the model immediately sees the word bomb,
There's a direct path to it refusing.
It goes from the word bomb.
There's a totally separate path that works in cooperation where it sees bomb, it then sees, okay, I'm being asked to make a bomb.
Okay, this is a harmful request.
I'm an AI agent.
And I've been trained to refuse this.
Right.
And so like one possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the like short circuit imitation C-bomb refuse with this deeper reasoning circuit.