Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
When you're operating within a single scale-up domain, is that a consideration specifically for either forward or backward?
Or specifically for pre-fill versus decode?
Or is it preferred to always be within a scale-up
Whatever kind of workload you have, whether you're doing a pre-training run or whether you're doing RLL generation or whether you're doing inference for users.
Can I try to guess?
Just out of curiosity to see if I'm actually understanding.
It seems like you're sending batch size into the rack.
In here?
Yes.
But the communication within a rack is sort of batch size times number of GPUs.
And there's a need to multiply the whole thing by two for the up and down.
And there's a factor of two.
It's interesting to me that the best parallelism
strategy and practice ends up being one which physically resembles the actual architecture.
It's not some galaxy brain thing.
You know, it's like, oh, we have experts.
We're going to put them on different GPUs.
Oh, we have different layers.
We're going to put them on different racks.
I mean, it could have been something wackier with tensor parallelism and whatever.