Aman Sanger
👤 PersonAppearances Over Time
Podcast Appearances
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you
start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.
start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.
start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.
Yeah. So like the basic, the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel. So you could increase either those dimensions, right? The batch size or the size of your prompts without degrading the latency of generating tokens.
Yeah. So like the basic, the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel. So you could increase either those dimensions, right? The batch size or the size of your prompts without degrading the latency of generating tokens.