Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
So there's a talk by Ilya where he says, today we know not to do pipeline parallelism.
And Horacy gave my friends and me, I hate that it sounds like a Dr. Seuss quote.
But he gave us a lecture on these different kinds of parallelisms.
And he said, the problem with pipeline parallelism is that it, other than the bubbles, it creates these architectural constraints on parallelism.
Like Kimi, for example, has these residuals where attention attends to the... A fewer back or something.
Yeah, it layers a few back, and so that becomes hard to implement in this way.
I guess the opposite connotation to this, which actually, before this interview, I was chatting with them,
Axel, who's a GPU performance engineer at Jane Street, he was explaining, well, to do pipeline, you had to do micro-batches rather than full batches.
And if you do micro-batches, then you're, by definition, not able to amortize
loading the weights across all the users or all the sequences.
And so the positive connotation of that is you don't have to use the memory.
The negative connotation of that is that we can't amortize loading the weights across all those users.
Maybe it's worth explaining why you had to do micro-batches because you can't.
Oh, interesting.
So this is sort of obvious, but the difference between micro bash and bash doesn't matter at all in inference because...
You can just call whatever you want, whatever.
Yeah.
It only matters in training because there is an optimal batch size.
Yes.
And before you do the backward step, you want to have accumulated... Before you do a full backward step, you want to have accumulated all the sequences in that batch.