Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Aman Sanger

๐Ÿ‘ค Speaker
1050 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Like one interesting proof of concept for the learning this knowledge directly in the weights is with VS Code. So we're in a VS Code fork and VS Code, the code is all public. So these models in pre-training have seen all the code. They've probably also seen questions and answers about it, and then they've been fine-tuned and RLHFed to be able to answer questions about code in general.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Like one interesting proof of concept for the learning this knowledge directly in the weights is with VS Code. So we're in a VS Code fork and VS Code, the code is all public. So these models in pre-training have seen all the code. They've probably also seen questions and answers about it, and then they've been fine-tuned and RLHFed to be able to answer questions about code in general.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So when you ask it a question about VS Code, sometimes it'll hallucinate, but sometimes it actually does a pretty good job at answering the question. And I think like this is just by, it happens to be okay at it. But what if you could actually like specifically train or post train a model such that it really was built to understand this code base?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

It's an open research question, one that we're quite interested in. And then there's also uncertainty of like, do you want the model to be the thing that end to end is doing everything, i.e. it's doing the retrieval and its internals and then kind of answering the question, creating the code?

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Or do you want to separate the retrieval from the frontier model where maybe, you know, you'll get some really capable models that are much better than like the best open source ones in a handful of months? Yeah. And then you'll want to separately train a really good open source model to be the retriever, to be the thing that feeds in the context to these larger models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

Is this... Yeah, I mean, there are many possible ways you could try doing it. There's certainly no shortage of ideas. It's just a question of going in and trying all of them and being empirical about which one works best. One very naive thing is to try to replicate what's done with VS Code and these frontier models.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

So let's continue pre-training, some kind of continued pre-training that includes general code data, but also throws in a lot of the data of some particular repository that you care about.

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,

Lex Fridman Podcast
#446 โ€“ Ed Barnhart: Maya, Aztec, Inca, and Lost Civilizations of South America

And then in post-training, meaning in, let's just start with instruction fine-tuning, you have like a normal instruction fine-tuning data set about code, but you throw in a lot of questions about code in that repository. So you could either get ground truth ones, which might be difficult, or you could do what you kind of hinted at or suggested using synthetic data, i.e.,