Dwarkesh Patel

👤 Speaker

15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So there's a talk by Ilya where he says, today we know not to do pipeline parallelism.

3266.761 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And Horacy gave my friends and me, I hate that it sounds like a Dr. Seuss quote.

3273.691 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But he gave us a lecture on these different kinds of parallelisms.

3283.125 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And he said, the problem with pipeline parallelism is that it, other than the bubbles, it creates these architectural constraints on parallelism.

3285.649 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Like Kimi, for example, has these residuals where attention attends to the... A fewer back or something.

3294.453 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, it layers a few back, and so that becomes hard to implement in this way.

3301.12 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I guess the opposite connotation to this, which actually, before this interview, I was chatting with them,

3367.863 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Axel, who's a GPU performance engineer at Jane Street, he was explaining, well, to do pipeline, you had to do micro-batches rather than full batches.

3376.955 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And if you do micro-batches, then you're, by definition, not able to amortize

3386.035 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

loading the weights across all the users or all the sequences.

3392.227 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so the positive connotation of that is you don't have to use the memory.

3398.354 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The negative connotation of that is that we can't amortize loading the weights across all those users.

3401.998 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Maybe it's worth explaining why you had to do micro-batches because you can't.

3407.425 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Oh, interesting.

3499.242 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is sort of obvious, but the difference between micro bash and bash doesn't matter at all in inference because...

3499.663 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You can just call whatever you want, whatever.

3508.917 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

3510.999 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It only matters in training because there is an optimal batch size.

3511.64 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yes.

3517.125 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And before you do the backward step, you want to have accumulated... Before you do a full backward step, you want to have accumulated all the sequences in that batch.

3517.825 View full episode →

← Previous Page 27 of 764 Next →

Report any issue