Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Aman Sanger

๐Ÿ‘ค Speaker
1050 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you