Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Um.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

When you look at some of the announcements, sometimes the API providers will brag about how much traffic they have.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The numbers that I've remembered from some announcements of Gemini last year were in the hundreds of millions of tokens per second worldwide.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So about a thousand, like this is one thousandth of that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so perform equality of the model, rather than speed of the model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So unfortunately, we're not able to answer that analytically.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That is an empirical question of model quality.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Best I can do is pull up a paper and answer that empirically.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this paper, this is Unified Laws for Routed Language Models.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's a somewhat old paper by this stage, but one of the things that they did is looked at, if I keep increasing sparsity, what is the model quality impact?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This answer is very sensitive to the actual choice of Mixture of Experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Mixture of Experts has been around for a really long time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I think it was even back in 2017.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

but the techniques have changed a lot.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

DeepSeq, a mixture of experts, was a big change in how it worked.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There have been older papers which are G-sharge, switch transformer.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the actual empirical results are going to depend on all of that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But on one of the older techniques that is shown here, you can see if I hold constant the number of active parameters at a certain size, and then I increase the sparsity, which they call expert count here, the quality keeps increasing.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then if you imagine drawing a horizontal line from