Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So, as...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

the length of the prefix goes up or pass your memory bandwidth time declines and that means that to the extent that you were bottlenecked on memory bandwidth before you can avoid being bottlenecked on memory bandwidth the fact that they are charging 5x less for

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

pre-fill then decode does suggest that they are bottlenecked on memory bandwidth to quite a degree such that for them at least, because T is equivalent to cost, right?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's the cost of rendering a compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is actually like, this would be at one and this would be at five.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it is in fact tremendously memory bandwidth bottlenecked.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The real graph looks something like, the real graph looks something like, like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So, yeah, let me do it this way.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, that's right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then this is the gap on decode between the memory and the compute time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Interesting.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Another interesting one would be why cache hits are so much cheaper.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I think, if I remember correctly, cache hits are like 10x.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's more expensive to write to cache according to the pricing on all these models.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But if you do hit a cache, it's 10x cheaper.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So what is going on with... Presumably, this is the cost of keeping something in HBM rather than just evacuating it.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But if you do keep it in HBM, then it's cheaper to load again?