Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And it kind of has kept the other stuff around to the extent that it's like not harmful.
Here at the Tarkash podcast.
Yeah, but my question is always, like, are you giving the model enough context?
And with agents now, like, are you giving it the tools such that it can go and get the context that it needs?
Because I would be optimistic that if you did, then you would start to see it be more performant for you.
All right, back to Trenton and Shoto.
I mean, even inside Anthropic and on the interpretability team, there is active debate over what the models can and can't do.
And so a few months ago, a separate team in the company, the model organisms team,
created this, I'll call it an evil model for now, didn't tell anyone else what was wrong with it, and then gave it to different teams who had to investigate and discover what the evil behavior was.
And so there were two interpretability teams that did this.
And we were ultimately successful.
One of the teams actually won in 90 minutes.
We were given three days to do it.
But more recently, I've developed what we're calling the interpretability agent, which is a version of Claw that has the same interpretability tools that we'll often use.
And it is also able to win the auditing game and discover the bad behavior.
End-to-end?
End-to-end, yeah.
You give it the same prompt that the humans had.
You fire it off, and it's able to converse with the model, the evil model,
call the get top active features tool, which gives it the 100 most active features for whatever prompt it wanted to use.