Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so... The more sparsity you have, the less compute you need,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And it does seem that as batch sizes get bigger, compute ends up being the bottleneck, according to this analysis.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So then the question is, how far can you take sparsity?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That is to say, as the sparsity ratio increases, as you have fewer and fewer active parameters relative to total parameters, how much is performance of the model degrading?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And is it degrading faster than you're saving compute by increasing the sparsity factor?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Should we pull up the paper now?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

10x as many active parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so while it is true, I guess, that you get this benefit of being able to economize on your compute time if you increase sparsity,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Naively, it would seem like, oh, that's a trade-off worth making.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But if you're decreasing this by 2x and then having this go up by 8x, every time you double...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let me just make sure I understood.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You're saying we want bigger... We want... Does it mean less time computing?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Therefore, we do more sparsity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

To make that work, we need bigger batch sizes, which means we need more memory capacity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so... To have more sparsity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So when you say any GPU in the pretense, the router is more than one GPU?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Before we... It may be worth you explaining...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

What exactly a rack is, the differences in bandwidth between a rack and within a rack, and the all-to-all versus not-all-to-all nature of communication within versus outside.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Sorry, is that question you just asked, basically, why isn't it a bigger scale-up?