Reiner Pope

Reiner Pope – The math behind how LLMs are trained and served

I mean, sparse attention gives you a get out for sure because you get this square root, like it gives you a big improvement.

6825.766 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But I think it's like, if you look at the history of context lengths of models,

6835.751 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

From earlier models like GPT-3, maybe to GPT-4, I don't remember when the transition happened exactly.

6844.143 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

They shot up from about 8K to 100K, 200K.

6849.791 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then for the last year or two, they've all been hovering around there.

6854.437 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I think that actually indicates that that's sort of the reasonably balanced cost point.

6857.261 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And going massively beyond that would be cost prohibitive.

6863.729 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Because of the memory bandwidth cost, yeah.

6869.878 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So I actually don't see a very good path to solving that.

6871.75 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The HBM is at where it is.

6879.581 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's not getting hugely better.

6883.787 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Sparse attention is a big improvement.

6888.093 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Maybe that is priced in already, perhaps.

6889.595 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's not an infinite improvement because if you go too sparse, you lose too much quality.

6893.941 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But yeah, I mean, the empirical result is that the context things haven't been increasing that much.

6898.345 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And I think it's because there is no solution to the memory wall here.

6903.756 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Interesting.

6908.886 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So going too sparse just means you're attending to a very small subset of the tokens and the quality will get worse.

6909.307 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So what is the cost of these different ways of producing, resynthesizing the KV cache?

6915.203 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Computing it from scratch is based on my GPU time.

6923.132 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment