Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Two times four-ish milliseconds.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I don't know how many you said, but 10 milliseconds per token is actually a lot.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And because it's decode and sequential, it's also not like they stack up across the stages.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You can't do them at the same time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's right, yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so I guess this brings us back to the question then.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Is the size of the scale-up at all relevant to why AI model sizes or whatever have been what they have been over the last few years, whether through training or through inference?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And the bandwidth problem helps you do longer context lengths, which is more and more relevant as these models get more authentic.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

A super tangential question.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's chinchilla scaling, which tells you how big should a model be relative to the amount of data you're going to train it on.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But now, obviously, you're not just trying to optimize for the highest quality model you could get with training compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You want the best results a user can get.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's a mixture of training and inference compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So then there's a question of how much should you over-train a model such that that compute amortized over training and inferences

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

minimize to get a certain performance.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But now with RL inference, there's, or RL, there's another consideration, which is you're going to do some minor pre-training.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That pre-training will be used both for RL generation and then for inference for the final user.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And by overtraining here, I mean, while it would have been more efficient just from a training computer perspective to have a bigger model that you train for less time because it can learn faster.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Maybe you get a smaller model, you spend more computer training than you otherwise would have, but now it's cheaper to give it to users.