Aman Sanger
π€ SpeakerAppearances Over Time
Podcast Appearances
Yeah, so... I think a lot of us, well, all of us were originally Vim users.
Pure Vim, yeah. No NeoVim, just pure Vim in a terminal. And at least for myself, it was around the time that Copilot came out. So 2021. that I really wanted to try it. So I went into VS Code, the only platform, the only code editor in which it was available. And even though I really enjoyed using Vim, just the experience of Copilot with VS Code was more than good enough to convince me to switch.
And so that kind of was the default until we started working on Cursor.
Yeah, I distinctly remember there was this one conversation I had with Michael where before I hadn't thought super deeply and critically about scaling laws. And he kind of posed the question, why isn't scaling all you need or why isn't scaling going to result in massive gains in progress? And I think I went through like the stages of grief.
There is anger, denial, and then finally at the end, just thinking about it, acceptance. And I think I've been quite hopeful and optimistic about progress since. I think one thing I'll caveat is I think it also depends on like which domains you're going to see progress.
Like math is a great domain because especially like formal theorem proving because you get this fantastic signal of actually verifying if the thing was correct. And so this means something like RL can work really, really well. And I think like you could have systems that are perhaps very superhuman in math and still not technically have AGI.
Yeah, I mean, I think this is a space that is quite interesting, perhaps quite unique, where if you look at previous tech waves, maybe there's kind of one major thing that happened and it unlocked a new wave of companies.
But every single year, every single model capability or jump you get in model capabilities, you now unlock this new wave of features, things that are possible, especially in programming. And so I think in AI programming, being even just a few months ahead, let alone a year ahead, makes your product much, much, much more useful.
I think the cursor a year from now will need to make the cursor of today look obsolete. And I think, you know, Microsoft has done a number of like fantastic things, but I don't think they're in a great place to really keep innovating and pushing on this in the way that a startup can. Just rapidly implementing features.
And kind of doing the research experimentation necessary to really push the ceiling.
Often the same person even.
There's this interesting thing where if you look at language model loss on different domains, I believe the bits per byte, which is kind of character normalized loss for code is lower than language, which means in general, there are a lot of tokens in code that are super predictable, a lot of characters that are super predictable.
And this is, I think, even magnified when you're not just trying to autocomplete code, but predicting what the user is going to do next in their editing of existing code. And so, you know, the goal of cursor taps, let's eliminate all the low entropy actions you take inside of the editor. When the intent is effectively determined, let's just jump you forward in time, skip you forward.
Yeah. I think I can speak to a few of the details on how to make these things work. They're incredibly low latency, so you need to train small models on this task. In particular... they're incredibly pre-filled token hungry. What that means is they have these really, really long prompts where they see a lot of your code and they're not actually generating that many tokens.
And so the perfect fit for that is using a sparse model, meaning an MOE model. Um, so that was kind of one, one breakthrough, one breakthrough we made that substantially improved performance at longer context. The other being, um, a variant of speculative decoding that we kind of built out called speculative edits.
These are two, I think, important pieces of what make it quite high quality and very fast.
Caching plays a huge role. Because you're dealing with this many input tokens, if every single keystroke that you're typing in a given line, you had to rerun the model on all of those tokens passed in, you're just going to, one, significantly degrade latency, two, you're going to kill your GPUs with load. So you need to design the actual prompts used for the model such that they're caching aware.
And then, yeah, you need to reuse the KV cache across requests just so that you're spending less work, less compute.
This is what we're talking about.
And there's a chance this is also not the final version of it.