Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So this actually gives you a ballpark, which is like remarkably accurate to practice.
Generally, people will go a little bit larger than this.
They don't really want to be exactly at the balance point because real world efficiencies aren't as good as a roofline analysis would say.
But like take this and maybe double it or triple it.
So we solve for the equivalence between when compute time is equal to memory time.
If I add in more memory bandwidth, like something that consumes more memory bandwidth, then I have less available for the weight loads, and so I need to grow the memory bandwidth more, and therefore the batch size more.
Yeah, okay.
So I guess this is, keep in mind that I'm talking about the number of tokens that I'm generating one more token for.
So it's like, it's actually 2,000 unique sequences.
Okay, we're just talking about...
Yeah.
The way to think about this, I guess we think of it as like, when does the train depart as a model?
So let's say I've picked a batch size that I'm going to run at.
Maybe I pick this batch size.
And so like, well, and by the way, this intersection point is the same intersection point here.
So I picked this batch size.
I know that it's going to take, for example, maybe it's something like 20 milliseconds is a common place to sense up landing.
What I'm going to produce is, uh, like, so this is a timeline of what is running on the GPU.
It's going to start a new batch every 20 milliseconds, uh, regardless.
And so, uh, so, so each of this is 20, this is 40.