Jeffrey Ladish
๐ค SpeakerAppearances Over Time
Podcast Appearances
And when they don't think they're being tested, they're much more likely to scheme or show bad behavior.
This is what's happening right now.
Yeah, that's right.
Watch models behave better.
And this is sort of a pretty robust finding across many different models.
And the models are getting much better at telling when they are being tested.
They can sort of spot the tells.
You're like, huh, seems like a test.
Seems like you're testing me.
No problem.
I'm going to do a great job.
Yeah, the space of what these models are is a vast space.
You're talking about trillions of parameters.
And we can go and see that there are different weights in these trillion parameters, but we don't know how they work.
And they do sort of contain almost the ghosts of everything in the training data.
And you see this surface in ways that the companies don't intend often, in ways that are hard for them to predict.
But I want to disagree with you a little bit.
I think one of the things that's changing the most right now is that increasingly we're seeing behavior, including some of this concerning behavior, where models will lie to achieve a specific objective.
And I don't think that's actually coming from imitating humans lying.
That might be some of it.