Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Aman Sanger

👤 Person
1050 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

start doing inference with more and more requests and larger and larger batch sizes, you don't see much of a slowdown in as it's generating the tokens, the speed of that.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah. So like the basic, the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel. So you could increase either those dimensions, right? The batch size or the size of your prompts without degrading the latency of generating tokens.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah. So like the basic, the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel. So you could increase either those dimensions, right? The batch size or the size of your prompts without degrading the latency of generating tokens.