Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there's a talk by Ilya where he says, today we know not to do pipeline parallelism.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And Horacy gave my friends and me, I hate that it sounds like a Dr. Seuss quote.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But he gave us a lecture on these different kinds of parallelisms.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And he said, the problem with pipeline parallelism is that it, other than the bubbles, it creates these architectural constraints on parallelism.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like Kimi, for example, has these residuals where attention attends to the... A fewer back or something.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, it layers a few back, and so that becomes hard to implement in this way.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I guess the opposite connotation to this, which actually, before this interview, I was chatting with them,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Axel, who's a GPU performance engineer at Jane Street, he was explaining, well, to do pipeline, you had to do micro-batches rather than full batches.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And if you do micro-batches, then you're, by definition, not able to amortize

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

loading the weights across all the users or all the sequences.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so the positive connotation of that is you don't have to use the memory.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The negative connotation of that is that we can't amortize loading the weights across all those users.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Maybe it's worth explaining why you had to do micro-batches because you can't.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Oh, interesting.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is sort of obvious, but the difference between micro bash and bash doesn't matter at all in inference because...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You can just call whatever you want, whatever.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It only matters in training because there is an optimal batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yes.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And before you do the backward step, you want to have accumulated... Before you do a full backward step, you want to have accumulated all the sequences in that batch.