Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
And we fit up to, I want to say, 16,000 features, which we thought was a ton at the time.
Fast forward nine months, we go from a two-layer transformer to our Cloud 3 Sonnet frontier model at the time and fit up to 30 million features.
And this is where we start to find really interesting abstract concepts like a feature that would fire for code vulnerabilities.
And it wouldn't just fire for code vulnerabilities.
It would even fire for, like, you know that Chrome page you get if you, like...
it's not an HTTPS URL, and it's like warning, this site might be dangerous, like click to continue.
And like also fire for that, for example.
And so it's like these much more abstract coding variables or sentiment features amongst the 30 million.
Fast forward nine months from that and now we have circuits.
And I threw in the analogy earlier of the Ocean 11 heist team where now you're identifying individual features across the layers of the model that are all working together to perform some complicated task.
And you can get a much better idea of how it's actually doing
the reasoning and coming to decisions, like with the medical diagnostics.
One example I didn't talk about before is with like how the model retrieves facts.
And so you say like, what sport did Michael Jordan play?
And not only can you see it hop from like Michael Jordan to basketball, answer basketball, but the model also has an awareness of when it doesn't know the answer to a fact.
And so by default, it will actually say, I don't know the answer to this question.
But if it sees something that it does know the answer to, it will inhibit the I don't know circuit and then reply with the circuit that it actually has the answer to.
So, for example, if you ask it who is Michael Batkin, which is just a made-up fictional person, it will by default just say I don't know.
It's only with Michael Jordan or someone else that it will then inhibit the I don't know circuit.
But what's really interesting here and where you can start making downstream predictions or reasoning about the model is that that I don't know circuit is only on the name of the person.