Aman Sanger

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.

4119.066 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.

4119.066 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

4143.041 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

4143.041 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.

4143.041 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

4165.302 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

4165.302 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

You're bottlenecked by how quickly, for long context with large batch sizes, by how quickly you can read those cache keys and values. That's memory bandwidth, and how can we make this faster? We can try to compress the size of these keys and values. Multi-query attention is the most aggressive of these.

4165.302 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

4185.616 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

4185.616 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Where normally with multi-head attention, you have some number of quote-unquote attention heads and some number of query heads. Multi-query just preserves the query heads, gets rid of all the key value heads. So there's only one kind of key value head, and there's all the remaining query heads.

4185.616 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

4207.662 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

4207.662 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

With group query, you instead preserve all the query heads, and then your keys and values are kind of... There are fewer heads for the keys and values, but you're not reducing it to just one. But anyways, the whole point here is you're just reducing the size of your KV cache.

4207.662 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

4228.814 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

4228.814 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, multi-latent. That's a little more complicated. And the way that this works is it kind of turns the entirety of your keys and values across all your heads into this kind of one latent vector that is then kind of expanded inference time.

4228.814 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

4344.016 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

4344.016 View full episode →

Lex Fridman Podcast

#446 – Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

what i mean ultimately how does that map to the user experience trying to get the yeah the two things that it maps to is you can now make your cash a lot larger because you've less space allocated for the kv cash you can maybe cash a lot more aggressively and a lot more things so you get more cash hits which are helpful for reducing the time to first token for the reasons that were kind of described earlier and then the second being when you

4344.016 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment