Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, whenever we have balance points, it kind of says that you're getting it exactly right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so for the particular context length where the slopes match, that says I am equally memory bound and compute bound, which is a really desirable place to be.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So that is true as modeled here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's a key point here that I'm modeling this context length as, or I'm modeling the memory fetch as linear in context length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That actually depends on model architecture.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It is true for many of the, all of the model architectures with dense attention.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's a sparse attention actually scales much better than that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm pretty excited about sparse attention.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's hard to know what the labs are using.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

DeepSeq has published a sparse attention mechanism.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'll just put a plug in that sparse attention, some of the DeepSeq papers that have published sparse attention end up putting a square root in this term.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

OK, so far we've looked at the latency.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's kind of hard to read off cost from this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if I think, what does cost mean?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

To run this inference, I'm going to use the GPU for a certain number of seconds, like 1 millisecond or 20 milliseconds or something like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And I have to pay the rental time for that time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it's $2 an hour per GPU or something like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

so so that's the cost of this inference but how much value have how many tokens have i processed during that inference that is the batch size and so what we actually want to plot is going to be the um the cost versus batch size which is like t over b versus batch size this is the cost per token um so