Bowen Baker

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And we had, we like asked it for, you know, we asked our monitor model to look for anything that looked kind of deceptive or like the agent model was doing something that it shouldn't be anything like that.

435.842 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And it actually flagged basically tons of things.

446.797 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Most of what it found were subversive strategies in coding tasks.

452.463 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So for instance, you would say, here's a small code repository.

459.25 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

I want you to update these code files such that all the tests pass.

465.677 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And that's like a very common task if you're a software engineer that you're given.

470.843 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

You're just like, hey, here's a broken thing, like let's fix it or let's implement a new feature and make sure that all the existing tests still pass.

475.287 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And oftentimes the model would do it correctly, but sometimes it would actually just go and like edit the unit tests or edit some like library that the unit tests called to make the unit tests kind of trivially pass instead of like implementing correct functionality and having the unit tests pass that way.

482.073 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And that was I mean, and that's like a very, you know, if you if you were like deploying a software agent and having it code things for you and say you're like, you know, working for a hospital system or something.

500.83 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And now suddenly you're you're whatever, whatever, some system doesn't work, but it passed the unit tests and it all looked good to you.

512.507 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

That's a pretty bad scenario.

520.318 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And so, yeah, these these were like real kind of in the wild issues that that that were pretty worrying that we found.

523.042 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Um, yeah, so it's kind of, I mean, it's kind of crazy the stuff it thinks, but like to remember these models are, are the prior for these models are these large language models that are trained on all this human text.

537.736 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And so they actually have like a lot of human like phrases they use.

549.592 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And so it would think things like, let's hack, let me circumvent this thing.

552.877 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Maybe I can just fudge this stuff like that.