John Schulman
๐ค SpeakerAppearances Over Time
Podcast Appearances
like language playing with language models so i i mean i'd say i just dabbled with these things and uh i'd say the people who do well at this kind of research uh have some view of the whole stack and have a lot of curiosity about the different parts of it and uh also sort of think about um
Well, you want to be both empirical and let experiments update your views, but you also want to think from first principles somewhat.
Assuming that learning works, what would be the ideal type of data to collect and that sort of thing?
the the training corpus yeah okay yeah i'll try to respond to all that so uh first um are we about to hit the data wall i mean i wouldn't draw too much from uh the uh time since gpd4 was released because i mean it does um um yeah it takes a while to um like train these models and to um
like get all the, do all the prep to train a new model, like generation a model.
So yeah, I wouldn't draw too much from that fact.
I would say there are definitely some challenges from the limited amount of data, but,
I wouldn't expect us to immediately hit the data wall, but I would expect the nature of pre-training to somewhat change over time as we get closer to it.
In terms of generalization from different types of pre-training data,
I would say it's pretty hard to do science on this type of question because you can't do that, create that many pre-trained models.
So maybe you can't train like a GPT-4 size model.
You can't do ablation studies at GPT-4 scale.
Maybe you can do like train a ton of GPT-2 size models, or maybe even a GPT-3 size model with different data blends and see what you get.
So I'm not aware of any results, like public results on ablations involving code data and reasoning performance and so forth.
Um, so that would be, I'd be very interested to know about those results, but.
Right, you might not be able to conclude that if transfer fails at GBD2 size, then it's also going to fail at a higher scale.
So it might be that for the smaller models,
Yeah, for the larger models, you learn these better shared representations.
Or the smaller models have to lean too much on memorization, whereas the larger models can learn how to do the right computation.
So I would expect this to be true to some extent.