Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so there is some time to fetch all of the total number of parameters, not just the active parameters.
So there's weight fetch time.
And then in addition, there's a kvcache fetch time.
So this actually depends on batch size.
So for every element of the batch, we have to fetch an entire context length.
worth of tokens, and then there's a size per token.
So like bytes, bytes for one token.
And so there's a model parameter.
Yeah.
So when I do a forward pass, let me draw actually how the autoregressive inference works.
So this is during decode.
So if I think I have a bunch of tokens of text, I'm drawing a tensor because ultimately the tokens are represented as some like tensor of in some embedding dimension.
And then in this direction, I have the sequence length.
The work of running a decode is I have to run each token through a whole bunch of matrix multipliers over a bunch of different layers.
And in general, I'm going to have to do that work over all of these tokens.
But then one step of decode is actually to produce just this one additional token out here.
And so what I'm going to do there is I'm going to run a full forwards pass of multiplying by all of the white matrices in the entire model.
But then I've got this attention mechanism where this token sort of, it's like looking at all of the past tokens in this way.
And what is it looking at specifically?
It is looking at some internal representation that the model has produced of the tokens.