Aman Sanger
๐ค SpeakerAppearances Over Time
Podcast Appearances
Like one interesting proof of concept for the learning this knowledge directly in the weights is with VS Code. So we're in a VS Code fork and VS Code, the code is all public. So these models in pre-training have seen all the code. They've probably also seen questions and answers about it, and then they've been fine-tuned and RLHFed to be able to answer questions about code in general.
Like one interesting proof of concept for the learning this knowledge directly in the weights is with VS Code. So we're in a VS Code fork and VS Code, the code is all public. So these models in pre-training have seen all the code. They've probably also seen questions and answers about it, and then they've been fine-tuned and RLHFed to be able to answer questions about code in general.
So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?
So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?
So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?
It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?
It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?
It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?
Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.
Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.
Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.
Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.
Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.
Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.
So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.
So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.
So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.
And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,
And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,
And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,