Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
that you really do care about getting the model to zero in on doing the reasonable things.
Yes.
Yeah.
I mean, to make the map from pre-training that are all really explicit here,
During pre-training, the large language model is predicting the next token of its vocabulary of, let's say, I don't know, 50,000 tokens.
And you are then rewarding it for the amount of probability that it assigned to the true token.
And so you could think of it as a reward, but it's a very dense reward where you're getting signal at every single token, and you're always getting some signal.
even if it only assigned 1% to that token or less, you're like, oh, I see you assigned 1%, good job, keep doing that.
Upweight it.
Yeah, exactly.
That's right, yeah, yeah, yeah.
You think so?
I don't know.
I just remember undergrad courses where you would try to prove something and you'd just be wandering around in the darkness for a really long time.
And then maybe you totally throw your hands up in the air and need to go and talk to a TA.
And it's only when you talk to a TA can you see where along the path of different solutions you were incorrect and like what the correct thing to have done would have been.
And that's in the case where you know what the final answer is, right?
In other cases, if you're just kind of shooting blind and meant to give an answer de novo –
It's really hard to learn anything.
But I think there's a lot of implicit dense reward signals here.