Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Aman Sanger

πŸ‘€ Speaker
350 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

So if you try it in all these benchmarks and things that are in the distribution of the benchmarks they're evaluated on, you know, they'll do really well. But when you push them a little bit outside of that, Sonnet's I think the one that kind of does best at kind of maintaining that same capability.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Like you kind of have the same capability in the benchmark as when you try to instruct it to do anything with coding.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, like in that case, it could be trained on the literal issues or pull requests themselves. And maybe the labs will start to do a better job, or they've already done a good job at decontaminating those things. But they're not going to emit the actual training data of the repository itself. Like these are all like some of the most popular Python repositories, like SymPy is one example.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

I don't think they're going to handicap their models on SymPy and all these popular Python repositories in order to get true evaluation scores in these benchmarks.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, with Claude, there's an interesting take I heard where I think AWS has different chips. And I suspect they have slightly different numerics than NVIDIA GPUs. And someone speculated that Claude's degraded performance had to do with maybe using the quantized version that existed on AWS Bedrock versus whatever was running on Anthropix GPUs.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

That's amazing. And you can do, like, other fancy things where if you have lots of code blocks from the entire code base, you could use retrieval and things like embedding and re-ranking scores to add priorities for each of these components.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

I think even as the system gets closer to some level of perfection, Often when you ask the model for something, not enough intent is conveyed to know what to do. And there are a few ways to resolve that intent. One is the simple thing of having the model just ask you, I'm not sure how to do these parts based on your query. Could you clarify that? I think the other could be maybe...

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

If there are five or six possible generations given the uncertainty present in your query so far, why don't we just actually show you all of those and let you pick them?

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, I mean, so we can go over a lot of the strategies that we use. One interesting thing is cache warming. And so what you can do is if, as the user is typing, you can have, you're probably going to use some piece of context. And you can know that before the user's done typing. So, you know, as we discussed before, Reusing the KV cache results in lower latency, lower costs, cross requests.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

So as the user starts typing, you can immediately warm the cache with like, let's say the current file contents. And then when they press enter, there's very few tokens. It actually has to pre-fill and compute before starting the generation. This will significantly lower TTFD.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah. So the way transformers work, I mean, like one of the mechanisms that allow transformers to not just independently, like the mechanism that allows transformers to not just independently look at each token, but see previous tokens. are the keys and values to tension.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

And generally the way attention works is you have at your current token, some query, and then you've all the keys and values of all your previous tokens, which are some kind of representation that the model stores internally of all the previous tokens in the prompt. And By default, when you're doing a chat, the model has to, for every single token, do this forward pass through the entire model.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

That's a lot of matrix multiplies that happen, and that is really, really slow. Instead, if you have already done that, and you stored the keys and values, and you keep that in the GPU...

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Then when I'm, let's say I have sorted for the last n tokens, if I now want to compute the output token for the n plus one token, I don't need to pass those first n tokens through the entire model because I already have all those keys and values. And so you just need to do the forward pass through that last token.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

And then when you're doing attention, you're reusing those keys and values that have been computed, which is the only kind of sequential part or sequentially dependent part of the transformer.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, that that there's other types of caching you can kind of do. One interesting thing that you can do for cursor tab is you can basically predict ahead as if the user would have accepted the suggestion and then trigger another request. And so then you've cached, you've done this speculative, it's a mix of speculation and caching, right?

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Because you're speculating what would happen if they accepted it. And then you have this value that is cached, this suggestion. And then when they press tab, the next one would be waiting for them immediately. It's a kind of clever heuristic slash trick. that uses a higher level caching and can give the... It feels fast despite there not actually being any changes in the model.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

Yeah, it is a little different than speed. But, I mean, like, technically you tie it back in because you can get away with the smaller model if you RL your smaller model and it gets the same performance as the bigger one. That's, like, and while I was mentioning stuff about... about reducing the size of your KB cache. There are other techniques there as well that are really helpful for speed.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

So kind of back in the day, like all the way two years ago, people mainly use multi-head attention. And I think there's been a migration towards more efficient attention schemes like group query or multi-query attention. And this is really helpful for then with larger batch sizes, being able to generate the tokens much faster.

Lex Fridman Podcast
#447 – Cursor Team: Future of Programming with AI

The interesting thing here is this now has no effect on that time to first token pre-fill speed. The thing this matters for is now generating tokens. And why is that? Because when you're generating tokens, instead of... being bottlenecked by doing these super-paralyzable matrix multiplies across all your tokens.