Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
And we had, we like asked it for, you know, we asked our monitor model to look for anything that looked kind of deceptive or like the agent model was doing something that it shouldn't be anything like that.
And it actually flagged basically tons of things.
Most of what it found were subversive strategies in coding tasks.
So for instance, you would say, here's a small code repository.
I want you to update these code files such that all the tests pass.
And that's like a very common task if you're a software engineer that you're given.
You're just like, hey, here's a broken thing, like let's fix it or let's implement a new feature and make sure that all the existing tests still pass.
And oftentimes the model would do it correctly, but sometimes it would actually just go and like edit the unit tests or edit some like library that the unit tests called to make the unit tests kind of trivially pass instead of like implementing correct functionality and having the unit tests pass that way.
And that was I mean, and that's like a very, you know, if you if you were like deploying a software agent and having it code things for you and say you're like, you know, working for a hospital system or something.
And now suddenly you're you're whatever, whatever, some system doesn't work, but it passed the unit tests and it all looked good to you.
That's a pretty bad scenario.
And so, yeah, these these were like real kind of in the wild issues that that that were pretty worrying that we found.
Um, yeah, so it's kind of, I mean, it's kind of crazy the stuff it thinks, but like to remember these models are, are the prior for these models are these large language models that are trained on all this human text.
And so they actually have like a lot of human like phrases they use.
And so it would think things like, let's hack, let me circumvent this thing.
Maybe I can just fudge this stuff like that.
Like, Oh wow.
Like when you read it, you're like, wow, it's really doing some, it's like very clear to see it's doing something bad.
Yeah, exactly.
And I think this is kind of like a classic problem in reinforcement learning.