Aman Sanger
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think for the more aggressive things where you're making larger changes that take longer periods of time, you'll probably want to do this in some sandbox remote environment. And that's another incredibly tricky problem of how do you
exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code the user's environment with this remote remote sandbox i'm curious what kind of agency you want for for coding did you want them to find bugs do you want them to like implement new features like what agency you want
exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code the user's environment with this remote remote sandbox i'm curious what kind of agency you want for for coding did you want them to find bugs do you want them to like implement new features like what agency you want
exactly reproduce or mostly reproduce to the point of it being effectively equivalent for running code the user's environment with this remote remote sandbox i'm curious what kind of agency you want for for coding did you want them to find bugs do you want them to like implement new features like what agency you want
Yeah. I mean, it's really interesting that these models are so bad at bug finding when just naively prompted to find a bug. They're incredibly poorly calibrated.
Yeah. I mean, it's really interesting that these models are so bad at bug finding when just naively prompted to find a bug. They're incredibly poorly calibrated.
Yeah. I mean, it's really interesting that these models are so bad at bug finding when just naively prompted to find a bug. They're incredibly poorly calibrated.
Exactly. Even 01. How do you explain that?
Exactly. Even 01. How do you explain that?
Exactly. Even 01. How do you explain that?
I think these models are really strong reflection of the pre-training distribution. And, you know, I do think they, they generalize as the loss gets lower and lower, but I don't think the, the loss and the scale is quite, or the loss is low enough such that they're like really fully generalizing in code. Like the things that we use these things for, uh, the frontier models, uh,
I think these models are really strong reflection of the pre-training distribution. And, you know, I do think they, they generalize as the loss gets lower and lower, but I don't think the, the loss and the scale is quite, or the loss is low enough such that they're like really fully generalizing in code. Like the things that we use these things for, uh, the frontier models, uh,
I think these models are really strong reflection of the pre-training distribution. And, you know, I do think they, they generalize as the loss gets lower and lower, but I don't think the, the loss and the scale is quite, or the loss is low enough such that they're like really fully generalizing in code. Like the things that we use these things for, uh, the frontier models, uh,
that they're quite good at are really code generation and question answering. And these things exist in massive quantities and pre-training with all of the code on GitHub on the scale of many, many trillions of tokens and questions and answers on things like stack overflow and maybe GitHub issues. And so when you try to push into these things that really don't exist, uh,
that they're quite good at are really code generation and question answering. And these things exist in massive quantities and pre-training with all of the code on GitHub on the scale of many, many trillions of tokens and questions and answers on things like stack overflow and maybe GitHub issues. And so when you try to push into these things that really don't exist, uh,
that they're quite good at are really code generation and question answering. And these things exist in massive quantities and pre-training with all of the code on GitHub on the scale of many, many trillions of tokens and questions and answers on things like stack overflow and maybe GitHub issues. And so when you try to push into these things that really don't exist, uh,
very much online, like, for example, the cursor tap objective of predicting the next edit given the edits done so far. The brittleness kind of shows. And then bug detection is another great example where there aren't really that many examples of actually detecting real bugs and then proposing fixes. And the models just really struggle at it. But I think it's a question of transferring the model.
very much online, like, for example, the cursor tap objective of predicting the next edit given the edits done so far. The brittleness kind of shows. And then bug detection is another great example where there aren't really that many examples of actually detecting real bugs and then proposing fixes. And the models just really struggle at it. But I think it's a question of transferring the model.
very much online, like, for example, the cursor tap objective of predicting the next edit given the edits done so far. The brittleness kind of shows. And then bug detection is another great example where there aren't really that many examples of actually detecting real bugs and then proposing fixes. And the models just really struggle at it. But I think it's a question of transferring the model.
In the same way that you get this fantastic transfer from pre-trained models just on code in general to the cursor tab objective, you'll see a very, very similar thing with generalized models that are really good at code to bug detection. It just takes a little bit of nudging in that direction.