Arvind Narayanan
👤 PersonAppearances Over Time
Podcast Appearances
So like the kind of shock that the AI community had when I think back in the day, I think it was GPT-2,
was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.
was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.
was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.
So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.
So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.
So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.
So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?
So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?
So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?
Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.
Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.
Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.
So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.
So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.
So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.
And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...
And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...
And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...
What we've learned in the last two years is that the quality of data matters a lot more than the quantity of data. So if you're using synthetic data to try to augment the quantity, I think it's just coming at the expense of quality. You're not learning new things from the data. You're only learning things that are already there.