Reiner Pope

Reiner Pope – The math behind how LLMs are trained and served

Yeah, whenever we have balance points, it kind of says that you're getting it exactly right.

Reiner Pope – The math behind how LLMs are trained and served

And so for the particular context length where the slopes match, that says I am equally memory bound and compute bound, which is a really desirable place to be.

698.372 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That's right.

731.232 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So that is true as modeled here.

731.692 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There's a key point here that I'm modeling this context length as, or I'm modeling the memory fetch as linear in context length.

734.315 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That actually depends on model architecture.

741.221 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It is true for many of the, all of the model architectures with dense attention.

743.564 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

748.288 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There's a sparse attention actually scales much better than that.

749.009 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'm pretty excited about sparse attention.

755.356 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's hard to know what the labs are using.

757.462 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

DeepSeq has published a sparse attention mechanism.

758.745 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'll just put a plug in that sparse attention, some of the DeepSeq papers that have published sparse attention end up putting a square root in this term.

761.192 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

OK, so far we've looked at the latency.

768.485 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's kind of hard to read off cost from this.

772.09 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So if I think, what does cost mean?

774.534 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

To run this inference, I'm going to use the GPU for a certain number of seconds, like 1 millisecond or 20 milliseconds or something like that.

776.737 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And I have to pay the rental time for that time.

785.631 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So it's $2 an hour per GPU or something like that.

789.396 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

so so that's the cost of this inference but how much value have how many tokens have i processed during that inference that is the batch size and so what we actually want to plot is going to be the um the cost versus batch size which is like t over b versus batch size this is the cost per token um so

793.342 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment