Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
And this goes back to our also our conversation of, you know, talking about like what, you know, is the chain of thought or sorry, is the model doing something instinctively or is it like actually thinking through it actively in the chain of thought?
there are certain pressures or things that could happen while training that could kind of shift that balance probably.
And so one is, you know, this thing of like, if you train the model to never say something, think something bad,
We showed in a paper earlier this last year that it can in some cases actually still do a bad thing, but not think the bad thoughts anymore.
We think it probably can only do this in cases where the chain of thought isn't actually necessary to do the bad thing, going back to this question of necessity.
of like of the chain of thought um but it still shows that you know there's actually some propensity for models to reason out loud even when they don't need to and you can kind of destroy that propensity with style pressures like that um and so yeah the model and then we call that obfuscation like the model is like obfuscating it's actually it's it's thinking um
And then there's kind of a lot of other reasons we think that, you know, chains of thought could become like less information dense or have yet be less monitorable.
So, for instance, people are worried even about just large scale RL or reinforcement learning, which is the way we train these models kind of just in the normal way we're doing it.
You could imagine that.
as you increase the compute there, models could eventually learn their own languages or change the way that they're using the English language or whatever language they're thinking in.
At the current scale, and this is something we found in our recent paper,
It doesn't seem that worrying, but it is, you know, in the reinforcement learning literature, there are examples, like people were working on like emergent language and emergent communication back in like, I don't know, 2016 to 2018, 19 or so.
And, you know, they showed it in like very constrained settings that if you just put in models that have no language prior, they're just totally randomly initialized, they can learn some language to talk to each other with.
And...
Exactly.
Yeah, yeah, yeah.
And so, yeah, so, but this might happen where, like, the model would, like, learn its own internal language of how to think that's maybe, like, a bit more efficient or compressed than the English language.
And it's, like, then it's harder for us to understand ourselves.
Or a different model to understand that hasn't, like, learned that same exact language, specific language.
Yeah.