Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Bowen Baker

👤 Speaker
414 total appearances

Appearances Over Time

Podcast Appearances

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Yeah.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So what we did were so when we this was kind of earlier last year and we were worried that people would kind of like start applying some like various pressures to the chain of thought because no one had kind of like thought through it yet.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And we were like hypothetically worried that, yeah, if you tell it to not think bad things, it could still think bad things and it's like activations and then do a do a bad thing afterwards.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So we did this experiment where we, you know, trained two models, our baseline that has like no, I don't know, we can call it like thought police or something.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And then another model where the thought police were present.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And the model with the thought police present actually did like do less bad things.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

when you because you know which makes sense like our models are like oftentimes very causally like they're they're they're causally affected by their chain of thought if they don't think a bad thought they might not actually do a bad thing and this has been shown in other in other papers from other groups as well um but eventually once it started to learn to obfuscate and um

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

and still do bad things, but while not having bad looking thoughts.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

At that point, we found that while it still did a bit less bad stuff, you would have preferred to have a model that was openly doing bad stuff and thinking all the bad thoughts and using a monitor.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

So for instance, you could catch more bad behavior, a higher percentage of bad behavior of the model that wasn't trained with thought police.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Using a chain of thought more than you could when you like had done this whole like thought police training.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And so you were better off at the end of the day, like not doing it.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

coincide how do they work together and what are your thoughts on that absolutely um yeah i was i was on that paper but if you look at the acknowledgement or whatever the um the uh contribution statement i was just the manager of leo leo leo is like they're like much more the experts in that um but i appreciate the disclaimers yeah

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

I wouldn't take credit for that paper.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

But I think that they are, in general, mechanistic interpretive.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

I think the circus varsity work is awesome and a very preliminary early sign of life for a type of model we could train that was natively more interpretable in its activation space.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

And I'm excited for people to continue pushing on that and, more broadly, the mechanistic interpretability field.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

historically it's been pretty hard like mechanistic interpretability um you know i we haven't solved it yet and people have been working on it for a long time like people could you at a high level explain to our viewers like why it's so difficult um yeah i'm not like an i would say i'm not like an expert in this but my take for why it's been so hard is that

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Similar to, I guess, like one way to break out the difference first between like chain of thought and mechanistic interpretability is mechanistic interpretability is more like doing a brain scan of all your neurons and activations in your brain and then trying to make some predictions about what you're going to do from them.

The Neuron: AI Explained
OpenAI Researcher Explains How AI Hides Its Thinking (w/ OpenAI’s Bowen Baker)

Which neuroscientists do do.