Sholto Douglas
👤 PersonAppearances Over Time
Podcast Appearances
So it's...
it's an efficiency question there.
Obviously, if you could give a dense reward for every token, if you had a supervised example, then that's one of the best things you could have.
But in many cases, it's very expensive to produce all of those scaffolded curriculum of everything to do.
Having PhD math students grade students is something which you can only afford for the select category of students that you've chosen to focus in on developing.
And you couldn't do that for all the language models in the world.
So like first step is obviously that would be better, but
you're going to be sort of optimizing this pre-order frontier of how much am I willing to spend on the scaffolding versus how much am I willing to spend on pure compute.
Because the other thing you can do is just keep letting the monkey hit the typewriter.
And if you have a good enough end reward, then eventually it will find its way.
And so I can't really talk about where exactly
people sit on that scaffold.
I think different people, different tasks are on different points there.
And a lot of it depends on how strong your prior over the correct things to do is.
But that's the equation you're optimizing.
It's like, how much am I willing to burn compute versus how much am I willing to burn dollars on people's time to give scaffolding or give awards.
MARK MANDELMANN- Interesting.
We are willing to do this for LLMs to some degree.
But there's a...
There's an equation you're maximizing here.