Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
But its chain of thought is totally misleading.
Like it will make up random stuff that sounds plausible or that tries to sound as plausible as possible.
But it's not at all representative of like the true answer.
Totally.
Yeah, yeah.
It's just some people will hail chain of thought reasoning as a great way to solve AI safety.
Oh, I see.
And it's like, actually, we don't know whether we can trust it.
And that's like two very simple agents, right?
I mean, I think a nice halfway house here would be features that you learned from dictionary learning.
Yeah, that would be really cool.
Where it's like you get more internal access, but a lot of it is much more human interpretable.
You can also have each of the smaller, well, each of the agents can be a smaller model that's cheaper to run and you can fine tune it so that it's actually good at the task.
Yeah, you can't train on the RL reward unless the model generates some reward.
Let's you encode the important shit to not die.
Yeah, I'm the wrong person to ask, but there are interesting interpretability pieces where if we fine tune on math problems, the model just gets better at entity recognition.
So there's like a...
a paper from David Bow's lab recently where they investigate what actually changes in a model when I fine tune it with respect to the attention heads and these sorts of things.
And they have this like synthetic problem of box A has this object in it.
Box B has this other object in it.