Bowen Baker

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Yeah.

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So what we did were so when we this was kind of earlier last year and we were worried that people would kind of like start applying some like various pressures to the chain of thought because no one had kind of like thought through it yet.

2261.702 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And we were like hypothetically worried that, yeah, if you tell it to not think bad things, it could still think bad things and it's like activations and then do a do a bad thing afterwards.

2276.948 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So we did this experiment where we, you know, trained two models, our baseline that has like no, I don't know, we can call it like thought police or something.

2286.104 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And then another model where the thought police were present.

2293.053 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And the model with the thought police present actually did like do less bad things.

2299.401 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

when you because you know which makes sense like our models are like oftentimes very causally like they're they're they're causally affected by their chain of thought if they don't think a bad thought they might not actually do a bad thing and this has been shown in other in other papers from other groups as well um but eventually once it started to learn to obfuscate and um

2306.81 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

and still do bad things, but while not having bad looking thoughts.

2328.578 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

At that point, we found that while it still did a bit less bad stuff, you would have preferred to have a model that was openly doing bad stuff and thinking all the bad thoughts and using a monitor.

2333.285 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So for instance, you could catch more bad behavior, a higher percentage of bad behavior of the model that wasn't trained with thought police.

2346.365 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Using a chain of thought more than you could when you like had done this whole like thought police training.

2354.498 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And so you were better off at the end of the day, like not doing it.

2360.184 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

coincide how do they work together and what are your thoughts on that absolutely um yeah i was i was on that paper but if you look at the acknowledgement or whatever the um the uh contribution statement i was just the manager of leo leo leo is like they're like much more the experts in that um but i appreciate the disclaimers yeah

2382.95 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

I wouldn't take credit for that paper.

2403.892 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

But I think that they are, in general, mechanistic interpretive.

2406.155 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

I think the circus varsity work is awesome and a very preliminary early sign of life for a type of model we could train that was natively more interpretable in its activation space.

2410.379 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And I'm excited for people to continue pushing on that and, more broadly, the mechanistic interpretability field.

2424.616 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

historically it's been pretty hard like mechanistic interpretability um you know i we haven't solved it yet and people have been working on it for a long time like people could you at a high level explain to our viewers like why it's so difficult um yeah i'm not like an i would say i'm not like an expert in this but my take for why it's been so hard is that

2431.904 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Similar to, I guess, like one way to break out the difference first between like chain of thought and mechanistic interpretability is mechanistic interpretability is more like doing a brain scan of all your neurons and activations in your brain and then trying to make some predictions about what you're going to do from them.

2455.529 View full episode →

The Neuron: AI Explained

OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Which neuroscientists do do.

2472.522 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment