Aman Sanger
๐ค SpeakerAppearances Over Time
Podcast Appearances
So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.
So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.
The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.
The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.
The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you
what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you