Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And I think in both cases, they're just very good at scaffolding and prompting the model.
I mean, even with the viral ChatGPT geoguessr capabilities, where it's just insanely good at spotting, like, what beach you were on from a photo.
Kelsey Piper, who I think made this viral...
Their prompt is so sophisticated.
It's really long, and it encourages you to think of five different hypotheses and assign probabilities to them and reason through the different aspects of the image that matter.
And I haven't A-B tested it, but I think unless you really encourage the model to be this thoughtful, you wouldn't get the level of performance that you see with that ability.
Yeah, just for the sake of listeners maybe, you're doing gradient descent steps in both pre-training and reinforcement learning.
It's just the signal's different.
Typically in reinforcement learning, your reward is sparser.
So you take multiple turns.
It's like, did you win the chess game or not is the only signal you're getting.
And often you can't compute gradients through discrete actions.
And so you end up losing a lot of gradient signal.
And so you can presume that pre-training is more efficient, but there's no reason why you couldn't learn new abilities in reinforcement learning.
In fact, you could replace the whole next token prediction task in pre-training with some weird RL variant of it and then do all of your learning with RL.
Yeah, at the end of the day, just signal and then correcting to it.
Totally.
And then going back to the paper you mentioned, aside from the caveats that Sholto brings up, which I think is the first order, most important, I think zeroing in on the probability space of meaningful actions comes back to the nines of reliability.
And classically, if you give monkeys a typewriter, eventually they'll write Shakespeare, right?
And so the action space for any of these real world tasks that we care about is so large