Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
So you said around two kilobytes.
So
So let's just do a sanity check for what this could be.
There are two mechanisms that people do attention with a small number of bytes per token.
One is dense attention with a lot of reuse across layers.
So Character AI has a blog post talking about that, alternating long and short context.
And in the Character AI kind of model, which also showed up in the Gemma models,
the global context which is really what we're talking about here global context was shared across all the layers and so to get this two kilobytes you could get that for example as a d head of 128 is typical and then like the number of bytes is typically number of attention layers times
two times D head times number of Q heads.
So this is the number of unique contexts per layer.
Do you share the context across many layers or do you use it only once?
So in character AI-like models, this number is one.
We said this is 128.
And this is a choice which typically ranges from one, sorry, this is KV heads, I meant.
there is written ahead in the kv head is that the kv heads are the heads that are stored in memory like store the contents of the previous tokens the q heads are the um the retrieval heads there they're only used temporarily and they're they're used by the attending token so um in this water aggressive context i've got kv heads associated with all of the context and then q heads associated with this new token here
But this head, the 128.
Oh, this is... This number is actually the same for... Oh, sorry.
This d-head is the dimension of the vector.
And number of kv-heads is typically in the range of 1 to 8.
So, like, it is totally plausible to get this by, for example, having 8 kv-heads and a d-head of 128.