Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
you know, a very, very dumb, small language model can pick up on signals like that.
And so when the model is being very, very obvious, it can, you know, they can pick up on those things.
And I think that is kind of like, yeah, that is probably the main mechanism by which that happens.
Yeah, so that's a great question.
So the outputs that the models have, so like what you read in ChatGPT, like what it actually gives you, those are optimized to be like, you know,
helpful to you do what you want be like safe and or you know like apply to any specifications we have on what a good output looks like and so you know if the model knows it's doing something bad
but it will know that it can't actually say that in the output because it's told not to do bad things for the user.
And so there's a clear reward to not reveal bad things happening in the output.
Whereas in the thinking, there's no such reward or pressure to do so.
It's like the thinking is kind of treated as a private space for the model.
And that's something that we at OpenAI really tried to hold a hard line on for the most part to not apply any kind of style pressures or to make the chain of thought look nice in any kind of way.
because we really want it to, you know, if it wants to think, like have a thought about a strategy for doing something bad, we want that to be very clear so that we can then monitor and catch it.
And then the other reason is that, you know, I guess like, so there's this thing of the output's,
are pressured to not show bad things.
But also, outputs can be incredibly complex.
We're seeing these models do more and more crazy things every day.
The size of a coding project a model can actually accomplish is, I don't know, probably pretty big by now, depending on what benchmarks you look at.
And imagine it's mostly good, but in...
one file, it didn't really know how to do a thing.
And it does this thing where it like stubs the unit tests and kind of like makes that one module not really work as it should.