Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
Yeah.
So what we did were so when we this was kind of earlier last year and we were worried that people would kind of like start applying some like various pressures to the chain of thought because no one had kind of like thought through it yet.
And we were like hypothetically worried that, yeah, if you tell it to not think bad things, it could still think bad things and it's like activations and then do a do a bad thing afterwards.
So we did this experiment where we, you know, trained two models, our baseline that has like no, I don't know, we can call it like thought police or something.
And then another model where the thought police were present.
And the model with the thought police present actually did like do less bad things.
when you because you know which makes sense like our models are like oftentimes very causally like they're they're they're causally affected by their chain of thought if they don't think a bad thought they might not actually do a bad thing and this has been shown in other in other papers from other groups as well um but eventually once it started to learn to obfuscate and um
and still do bad things, but while not having bad looking thoughts.
At that point, we found that while it still did a bit less bad stuff, you would have preferred to have a model that was openly doing bad stuff and thinking all the bad thoughts and using a monitor.
So for instance, you could catch more bad behavior, a higher percentage of bad behavior of the model that wasn't trained with thought police.
Using a chain of thought more than you could when you like had done this whole like thought police training.
And so you were better off at the end of the day, like not doing it.
coincide how do they work together and what are your thoughts on that absolutely um yeah i was i was on that paper but if you look at the acknowledgement or whatever the um the uh contribution statement i was just the manager of leo leo leo is like they're like much more the experts in that um but i appreciate the disclaimers yeah
I wouldn't take credit for that paper.
But I think that they are, in general, mechanistic interpretive.
I think the circus varsity work is awesome and a very preliminary early sign of life for a type of model we could train that was natively more interpretable in its activation space.
And I'm excited for people to continue pushing on that and, more broadly, the mechanistic interpretability field.
historically it's been pretty hard like mechanistic interpretability um you know i we haven't solved it yet and people have been working on it for a long time like people could you at a high level explain to our viewers like why it's so difficult um yeah i'm not like an i would say i'm not like an expert in this but my take for why it's been so hard is that
Similar to, I guess, like one way to break out the difference first between like chain of thought and mechanistic interpretability is mechanistic interpretability is more like doing a brain scan of all your neurons and activations in your brain and then trying to make some predictions about what you're going to do from them.
Which neuroscientists do do.