Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
Like I doubt there's going to be like one strategy or one method that makes models completely safe without anything else.
Oh, yeah.
So I guess there's, well, there's maybe two, I'll break it down into two things.
There's kind of like, how, how would we actually use a monitor in practice?
And then like, what are we what, what about obfuscation?
Can models actually hide their thinking?
So I'll try to answer that first one.
And then I can go to that we can talk about the second one.
But yeah, so like, how would you actually do it?
I think
I mean, obviously, every company wants to probably get their latency down for giving users results as much as they can.
But for complex knowledge work, people are already waiting quite a while to get the report back or their code base updated or so on.
And so if we had to wait slightly longer because we were monitoring the outputs of the model as it was going...
that seems plausibly like something that products would do if it actually had a critical safety function.
And so you could, and so like one nice thing there too, is that like, you know, every action the model takes, you can kind of monitor the chain of thought
and the action before you actually execute the action, if you're willing to pay that latency cost of waiting to execute the action and start the next part of the reasoning.
And if you do that, you can probably mitigate some harms before they happen because you know that the model was thinking about doing something bad.
And then you can just say, oh, it was thinking about doing something bad.
I'm just going to cut it off here.
And I'll retry or tell the user I can't do the thing for some reason.