Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think we can attack it, but we're going to need to be persistent.
And the real hope here is, I think, automated interpretability.
And even having debate, right?
You could have the debate set up where two different models are debating what the feature does, and then they can actually go in and make edits and see if it fires or not.
But it is just this wonderful closed environment that we can iterate on really quickly.
I mean, bus factor doesn't define how long it would take to recover from it, right?
And deep learning research is an art.
And so you kind of learn how to read the loss curves or set the hyperparameters in ways that empirically seem to work well.
That is like difficult to share.
Yeah, if it works well, it's probably not being published.
Yeah, I do think the tide is changing there for whatever reason.
And like Neil Nanda has had a ton of success promoting interpretability in a way where like Chris Ola hasn't been as active recently in pushing things, maybe because Neil's just doing quite a lot of the work.
But like, I don't know, four or five years ago, he was like really pushing and like talking at all sorts of places and these sorts of things.
And people weren't anywhere near as receptive.
Maybe they've just woken up to like deep learning matters and is clearly useful post-chat GPT, but...
So there's this ongoing discussion of, like, are models sentient or not?
And, like, do you thank the model when it helps you?
Yeah.
But I think if you want to thank it, you actually shouldn't say thank you.
You should just give it a sequence that's very easy to predict.