Reiner Pope

👤 Speaker

1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so what is the minimum work you can do per batch after amortizing everything else away?

956.221 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You can just solve for that, actually.

977.39 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And it's not even particularly sensitive to model architecture.

979.032 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So let's go ahead and do that.

982.015 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So what we're talking about is we're going to say when the memory time is equal to the compute time.

984.118 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That's what that question is.

988.784 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

For now, I'm going to discard the... Because we're focused on what the batch size is, and really there's a question of when the weights are amortized over the multiplies, I'm going to focus on comparing the weight fetch time to the weight multiply time.

992.669 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'm going to disregard the kbfetch term just to simplify the analysis so we can get kind of a clean answer out.

1007.864 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we're going to equate this portion

1016.253 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

with this, with these two times.

1022.099 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So writing that out, we get n number of total parameters over memory bandwidth is equal to batch size times number of active parameters divided by the compute performance.

1027.007 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So looking over here, everything on the top, these are model parameters.

1051.715 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Everything on the bottom, these are hardware parameters.

1055.839 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It turns out to be nice to rearrange them such that we have the hardware parameters on one side.

1058.342 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is equivalent to... So the memory bandwidth being equal to batch size times number of active parameters

1062.446 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

divided by the number of total parameters.

1081.651 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is a hardware parameter.

1085.575 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Actually, this actually ends up being a dimensionless constant.

1087.016 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

If you look in terms of flops, what are the dimensions of this?

1090.78 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This is multiplies per second.

1094.203 View full episode →

← Previous Page 8 of 58 Next →

Report any issue