Bowen Baker
👤 SpeakerAppearances Over Time
Podcast Appearances
And it's not like impossible, but it's hard.
You know, there's still like, you know, we haven't we can still have a hard time doing it for mice and things like that.
Right.
Yeah.
And it's just really high dimensional.
There's a lot of things going on.
You know, you're trying to find in like these like billions of neurons, you know, which pattern amongst them is the reason why you like do a bad thing or the reason why you like walk or, you know, any or twitch your muscle.
There's so many things they're doing all at once.
Yeah.
Whereas, you know, the chain of thought interpretability thing is a bit more like reading your inner monologue, which is much more high level, much like, I guess, like lower information density.
It doesn't have everything there, but it does seem to have like the most high level relevant things to the model's actions, which is why it's been so successful in comparison and like, um,
But again, mechanistic interpretability does not have the same issues of fragility that chain of thought interpretability has because the activations aren't going to go anywhere.
The information should all still be there at the end of the day, but it might eventually not be in the chain of thought depending on how everything goes.
I'll yeah this is so I would just give my own opinion I wouldn't yeah yeah just just as Bowen yeah yeah but yeah I'm pretty worried I would be pretty worried about open or I'm I'm worried about open source I think that
You know, if you had like there's a reason why we like try to control like certain types of information from the public and have or like have controls over like weapons and things like that.
And if you think a model could be utilized as any kind of harmful weapon like thing, you know, whether it be.
Uh, making a bio weapon is a big thing.
People are worried about cyber security or like cyber attacks is another big thing, um, that people are worried about.
Uh, I think just kind of having like a downloadable weapon online sounds pretty bad to me.
Like, yeah.