Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
It's a domain which just naturally lends it to this way.
Does it compile?
Does it pass the test?
You can go on LeetCode and you can run tests and you know whether or not you got the right answer.
But there isn't the same kind of thing for writing a great essay.
The question of taste in that regard is quite hard.
We discussed the other night at dinner the Pulitzer Prize.
which would come first, a Pulitzer Prize winning novel or a Nobel Prize or something like this.
And I actually think a Nobel Prize is more likely than a Pulitzer Prize winning novel in some respects.
Because a lot of the tasks required in winning a Nobel Prize, or at least strongly assisting in helping to win a Nobel Prize, have more layers of verifiability built up.
So I expect them to accelerate the process of doing Nobel Prize winning work
More initially than that of like writing Pulitzer Prize-worthy novels.
Copy paste, copy paste, copy paste.
Right, like carving away the marbles on this.
I think it's worth noting that that paper was, I'm pretty sure, on the Lama and Quen models.
And I'm not sure how much RL compute they used, but I don't think it was anywhere comparable to the amount of compute that was used in the base models.
And so I think the amount of compute that you use in training is a decent proxy for the amount of actual raw new knowledge or capabilities you're adding to a model.
So my prior at least, if you look at all of DeepMind's research from RL before,
RL was able to teach these Go and chess playing agents new knowledge that were in excess of human level performance just from RL signal, provided the RL signal was sufficiently clean.
So there's nothing structurally limiting about the algorithm here that prevents it from imbuing the neural net with new knowledge.