Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Let's do that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is the inference diagram, and I'll call this forward, so we don't have the wrong thing showing up there.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's do the same thing for training now.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We've got a forwards pass, but at some stage, we're going to have to transition to a backwards pass.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we'll do some number of batches in the forwards pass.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we're going to transition to the backwards pass for everyone all in one go.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the inference part is the same here, but then we do a hard stop at this point and then transition everyone to backwards pass.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Similar numbering like this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, I mean, smaller is always better, actually, is a way to put it.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But from a ML convergence rate perspective, smaller is always better because basically you're getting the freshest information from the gradient descent.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But total trading time perspective.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Total trading time perspective, smaller is worse from a systems perspective, and so the optimum is the trade-off between those two.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So you pick a batch size, and then for that batch size, you do some amount forwards and then some amount backwards.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yep.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You asked why is there even a hard stop.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Pipeline parallelism, because of the fact that you've got this idle time here, which is the bubble,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There are so many techniques in the literature for how to lay this out differently and avoid that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There are more complicated schemes called like zero bubble or one forward, one backward, which sort of interleave the forwards and the backwards in complicated ways.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You can mine Bitcoin in that.