Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

is equal to scale up size,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

memory bandwidth per gpu yeah yeah times gpu bandwidth um uh and so this term doesn't increase a lot it maybe increases 1.5 or 2x per generation but this one increased by like a factor of eight um from these problems so the reason the bigger scale of matter is not the memory capacity of the whole scale scale up but really the memory bandwidth yeah yeah pipelining totally solves the capacity problem but um but uh uh scale up size helps solve the bandwidth problem

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It lets you just run the model at lower latency as a first thing.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If I just do a very sparse model and it's on a little H100 box, the latency will be really high.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is a place where we have to do a bit of guesswork because like the updated scaling laws and the model traffics are not reported.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so we have to guess there.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But one way to look at it, let me first just make a sort of a general heuristic claim.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If I have some like cost and I've got a total cost, which is a sum of like cost A and cost B,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like maybe this is the training cost and this is the inference cost.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so I want to minimize this sum.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

For many curves that tend up being the case, the minimum tends to be where the costs are equalized.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's something of a heuristic claim, but there are many examples where it's true, like where one is 1 over x and the other one is x, for example.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They tend to be minimized at the point where they equal each other.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's also true for e to the x and e to the minus x and all kinds of other things.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So basically, I've got some curve that's going down, some other curve that's going up, and they tend to be minimized at this equal point.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Heuristically, I will conjecture that that is true for the setup you described as well.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Actually showing that that would be true would require looking at the scaling laws and fitting these weird exponents.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But things that do follow power laws tend to have this property.