Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Two times four-ish milliseconds.
I don't know how many you said, but 10 milliseconds per token is actually a lot.
And because it's decode and sequential, it's also not like they stack up across the stages.
You can't do them at the same time.
That's right, yeah.
Okay, so I guess this brings us back to the question then.
Is the size of the scale-up at all relevant to why AI model sizes or whatever have been what they have been over the last few years, whether through training or through inference?
And the bandwidth problem helps you do longer context lengths, which is more and more relevant as these models get more authentic.
Okay.
A super tangential question.
There's chinchilla scaling, which tells you how big should a model be relative to the amount of data you're going to train it on.
But now, obviously, you're not just trying to optimize for the highest quality model you could get with training compute.
You want the best results a user can get.
It's a mixture of training and inference compute.
So then there's a question of how much should you over-train a model such that that compute amortized over training and inferences
minimize to get a certain performance.
But now with RL inference, there's, or RL, there's another consideration, which is you're going to do some minor pre-training.
That pre-training will be used both for RL generation and then for inference for the final user.
And by overtraining here, I mean, while it would have been more efficient just from a training computer perspective to have a bigger model that you train for less time because it can learn faster.
Maybe you get a smaller model, you spend more computer training than you otherwise would have, but now it's cheaper to give it to users.