Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Aman Sanger

๐Ÿ‘ค Speaker
1050 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And generally the way attention works is you have at your current token, some query, and then you've all the keys and values of all your previous tokens, which are some kind of representation that the model stores internally of all the previous tokens in the prompt. And By default, when you're doing a chat, the model has to, for every single token, do this forward pass through the entire model.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that, and you stored the keys and values, and you keep that in the GPU...

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that, and you stored the keys and values, and you keep that in the GPU...

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that, and you stored the keys and values, and you keep that in the GPU...

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Then when I'm, let's say I have sorted for the last n tokens, if I now want to compute the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values. And so you just need to do the forward pass through that last token.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Then when I'm, let's say I have sorted for the last n tokens, if I now want to compute the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values. And so you just need to do the forward pass through that last token.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Then when I'm, let's say I have sorted for the last n tokens, if I now want to compute the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values. And so you just need to do the forward pass through that last token.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, that that there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you've cached, you've done this speculative, it's a mix of speculation and caching, right?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, that that there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you've cached, you've done this speculative, it's a mix of speculation and caching, right?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, that that there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you've cached, you've done this speculative, it's a mix of speculation and caching, right?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Because you're speculating what would happen if they accepted it. And then you have this value that is cached, this suggestion. And then when they press tab, the next one would be waiting for them immediately. It's a kind of clever heuristic slash trick. that uses a higher level caching and can give the... It feels fast despite there not actually being any changes in the model.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Because you're speculating what would happen if they accepted it. And then you have this value that is cached, this suggestion. And then when they press tab, the next one would be waiting for them immediately. It's a kind of clever heuristic slash trick. that uses a higher level caching and can give the... It feels fast despite there not actually being any changes in the model.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Because you're speculating what would happen if they accepted it. And then you have this value that is cached, this suggestion. And then when they press tab, the next one would be waiting for them immediately. It's a kind of clever heuristic slash trick. that uses a higher level caching and can give the... It feels fast despite there not actually being any changes in the model.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, it is a little different than speed. But, I mean, like, technically you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as the bigger one. That's, like, and while I was mentioning stuff about... about reducing the size of your KB cache. There are other techniques there as well that are really helpful for speed.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, it is a little different than speed. But, I mean, like, technically you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as the bigger one. That's, like, and while I was mentioning stuff about... about reducing the size of your KB cache. There are other techniques there as well that are really helpful for speed.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Yeah, it is a little different than speed. But, I mean, like, technically you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as the bigger one. That's, like, and while I was mentioning stuff about... about reducing the size of your KB cache. There are other techniques there as well that are really helpful for speed.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.