Reiner Pope

It's a somewhat old paper by this stage, but one of the things that they did is looked at, if I keep increasing sparsity, what is the model quality impact?

1742.995 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This answer is very sensitive to the actual choice of Mixture of Experts.

1750.677 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Mixture of Experts has been around for a really long time.

1754.688 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I think it was even back in 2017.

1757.296 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

but the techniques have changed a lot.

1761.258 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

DeepSeq, a mixture of experts, was a big change in how it worked.

1763.02 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There have been older papers which are G-sharge, switch transformer.

1766.603 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So the actual empirical results are going to depend on all of that.

1770.106 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But on one of the older techniques that is shown here, you can see if I hold constant the number of active parameters at a certain size, and then I increase the sparsity, which they call expert count here, the quality keeps increasing.

1774.311 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then if you imagine drawing a horizontal line from

1786.482 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment