Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so in the paper, we also ask it, what paper did Andre Karpathy write?
And so it recognizes the name Andre Karpathy because he's sufficiently famous.
So that turns off the I don't know reply.
But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers.
And so then it needs to make something up.
And so you can see different components and different circuits all interacting at the same time to lead to this final answer.
I feel like you just want to go in with your eyes wide open, not making any assumptions for what that deception is going to look like.
or what the trigger might be.
And so the wider you can cast that net, the better.
Depending on how quickly AI accelerates and where the state of our tools are, we might not be in the place where we can prove from the ground up that everything is safe.
But I feel like that's a very good North Star.
It's a very powerful, reassuring North Star for us to aim for, especially when we consider we are part of the broader AI safety portfolio.
I mean do you really trust โ like you're about to deploy this system and you really hope it's aligned with humanity and that you've like successfully iterated through all the possible ways that it's going to like scheme or sandbag.
Like we want to pursue the entire portfolio.
We've got the therapist interrogating the patient by asking, do you have any troubling thoughts?
We've got the linear probe, which I'd analogize to like a polygraph test where we're taking like
Very high-level summary statistics of the person's well-being.
And then we've got the neurosurgeons kind of going in and seeing if you can find any brain components that are activating and troubling or off-distribution ways.
So I think we should do all of it.
What percent of the alignment portfolio should it in Macintosh be?