Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
On a small number of pipelining stages, this is not a huge latency impact.
Yeah, if it goes from 20 to 30, right?
Or something like that, yeah.
So just to chart the path that it goes through, here you're going from your GPU or TPU or whatever to a network card, which then goes to a top of rack switch.
and then hops over to the other rack and does the same thing in reverse.
So you sort of have to sum up the latencies of these different things.
So this is the same thing as the DC switch?
It may, in fact, go up to a DC switch and back.
Depends on deployment configuration.
Yeah, so, I mean, we talked about latency of the hop, of this hop.
There is also just the same T-mem latency, the memory time latency, is actually substantially, like, massively improved by larger scale-up domains.
So...
I'll recall tmem down here.
tmem for the weights, tmem of weights.
This was equal to the number of total parameters divided by the memory bandwidth.
which memory bandwidth are we talking about here?
Is it just one GPU or it's in fact, it is the number of GPUs that I can use in parallel to load these weights.
So I can't use different pipeline stages in parallel because they're not running at the same time, but I can use all the GPUs in my scale up domain in parallel to load the weights.
And so this is actually extremely effective.
So basically I end up with a term here, this memory bandwidth term itself,