Grant Harvey
👤 SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Doing good.
Doing good.
Really excited today because we are talking about something deceptively simple but incredibly fragile, whether or not we can monitor an AI's reasoning before it causes harm.
Reasoning models don't just spit out answers, they work through their responses.
They plan, they deliberate, and sometimes that internal chain of thought reveals intent that never shows up in the final output.
I guess let's start a little bit further back than that.
So Bowen, when did you join OpenAI and how did you get involved into monitorability research more broadly?
Pretty big.
An example of that would be like if it's if it knows that the answer is somewhere on the desktop and it like goes and finds that file and then like gets the answer directly from that.
Right.
So can you walk us through a real example of where you might have caught a frontier model trying to hack or cheat its way through something?
What did you actually see in the thinking that indicated that that was happening?
Or was this before you were looking at the traces?
That's so interesting, though, because I guess in certain circumstances, you actually would want it to find the fastest solution.
You just wanted to find the fastest, correct solution doing what you wanted to do, right?
Do you feel a little bit like a parent teaching a child?
Because that scenario to me is like, it feels very reminiscent of a parent being like, okay, I'm going to tell you what to do.
And then the kid finding the loophole in what you told them because you didn't tell them perfectly the right way to do it or whatever.
Wow.
What's the difference then between monitoring what an AI does versus monitoring what it thinks?