Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
And this is what largely leads to like reward hacking type behavior.
And reward hacking basically means where you give your model a task, but you haven't thought through all the possible ways it could do the task.
And actually there exists a strategy that is like very much not what you wanted it to do.
Yeah.
Yeah.
Like you design the task.
You did your best to put in all the guardrails in the environment that that you're like agent or model is acting in.
And you just didn't think of this like one strategy that it could take.
Yeah, exactly.
Like, yeah, that's a great example.
Yeah, absolutely.
So earlier last year, around March, we published a paper on this work.
But we had started, kind of as a YOLO, we had started monitoring one of our frontier reasoning models as it trained to kind of see if this thing was useful at all.
At that time, we didn't know.
And that's why it was like a YOLO.
We're just like, we'll spin it up before...
winter break and we'll see what we come back with, you know, like if it found anything.
And what we were doing was we spun up a different model, like a GPT at the time, GPT 4.0.
And we had it read the chain of thought, which is basically just English.
It like looks a bit weird, but it's mostly understandable.