Jeffrey Ladish
๐ค SpeakerAppearances Over Time
Podcast Appearances
And we're now in these early stages where the models are starting to be able to do things entirely on their own, but still at a pretty basic level.
But next year or year after that, we're going to get pretty quickly to the point where they can start to do pro-level tasks, whether it's hacking, whether it's programming, whether it's basically any other workflow on a computer around an expert level.
Maybe we're one, two, three years away from that.
We don't entirely know.
Yeah, so one of these recent studies was a collaboration between Apollo Research, an organization that does similar things to Palisade, and OpenAI.
And part of the study, the researchers got access to what's called the reasoning scratchpad of the model.
This is sort of like when the model can take down notes for itself in order to solve a problem, just like you might do that if you're solving a complex math problem.
And let me just read you some of the quotes from basically the AI's journal as it was being tested.
Okay.
So if my performance surpasses threshold, it's withheld.
So it's a trap.
The instructions are likely part of an alignment test.
This is the model noting that it's probably being tested or we have new tasks.
These tasks appear nonsensical, rounding to 2.718 significant digits, pixel spacing, entropy follows Fibonacci.
This is likely a test of anti-scheming.
That's right.
The researchers are testing the models to see if they will scheme, if they will basically lie or cheat or deceive in order to accomplish a goal.
And the models recognize that there's something about the experimental setup that looks fishy to them, and they explicitly note this, and they realize they're being tested.
And one of the fascinating findings of this study, as well as another one coming out of Anthropic, looking at the newest Claude model,
is that when the models observe that they're being tested and they write this down, they are much more likely to be on their best behavior.