Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
And
We hope that if we make a broad enough suite of evaluations that test monitorability in different ways, that when a model does start to become obfuscated or verbalize less for whatever reason, we'll see one of those evals start to go down.
And we did kind of show a proof of concept of that in the paper where we directly, deliberately trained models to obfuscate by putting some kind of style-like pressures on the COT.
And we then did see...
some of our evals decrease as like a proof of concept that, you know, if a model, if this kind of happened in the wild naturally, we'd maybe be able to catch it.
Totally possible.
I guess if I knew what the next scaling paradigm was, I'd be a much richer man.
Yeah, I think that is definitely a worry.
We've also like, I mean, so generally in our current experiments, we haven't necessarily found that to be true to a significant extent.
I think that this is likely because, you know, we are always improving our ability to train even, you know, even a small model.
It's not like we have our GPT-4 size model and that's the best GPT-4 size model anyone could ever train.
We're like improving our algorithms, our data.
you know, no one's sleeping.
And so I think that, you know, a better model also kind of like has more clear thoughts sometimes, if that makes sense.
Like a small model, if you look at its thoughts, it sometimes feels a lot more gibberish-y than a bigger model.
And so, you know, that is maybe the reason why we could expect this to, you know, at least
not fully, this trend to not fully happen.
But I think at the limit, it probably should happen.
Like bigger models will just verbalize less for the most part.
Yeah.