Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
I worked in reinforcement learning primarily before working on safety.
And this was just like a thing that if you were doing anything, you would constantly find like reward hacks.
You would find that you...
you like didn't code your environment entirely correctly.
And there was this like weird strategy that could happen like, you know, back in the day of doing like physics simulators and like training models and physics simulators, you'd find them like jumping through walls because they found the exact right way to like break the physics simulator and like pass through an object and stuff like that, that you were just always had to kind of like patch these holes.
And so, yeah, the same thing is happening with reasoning models and large language models.
Yeah, exactly.
I mean, I think someone had recently brought up a similar thing of like the kid, you know, like opening up the cookie jar when the parents not looking.
It's like exactly the same kind of thing.
You've like set up a reward function or something you want them to do.
But then they realize that, oh, you know, like I'm not being observed in this case.
So then I can do it then or, you know, anything like that.
It is.
It's a bit like parenting.
Yeah.
Yeah, I mean, definitely we find that stronger monitors like do perform the monitoring task better.
OK.
But that being said, weak monitors can still catch a lot of bad stuff that strong models do.
And that's because I think I mean, I don't have we don't have like hard evidence for this.
But if you remember, like the types of phrases I mentioned, like let's hack, let's fudge this type of thing.