Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Arvind Narayanan

👤 Person
528 total appearances

Appearances Over Time

Podcast Appearances

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So like the kind of shock that the AI community had when I think back in the day, I think it was GPT-2,

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

was trained primarily on English text, and they had actually tried to filter out text in other languages to keep it clean, but a tiny amount of text from other languages had gotten into it, and it turned out that that was enough for the model to pick up a reasonable level of competence for conversing in various other languages.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So these are the kinds of emergent capabilities that really spooked people, that has led to both a lot of hype and a lot of fears about what bigger and bigger models are going to be able to do. But I think that has pretty much run out because we're training on all of the capabilities that humans have expressed, like translating between languages, and have already put out there in the form of text.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So if you make the data set a little bit more diverse with YouTube video, I don't think that's fundamentally going to change. Multimodal capabilities, yes, there's a lot of room there. But new, emergent text capabilities, I'm not sure. MARK BLYTH What about synthetic data?

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

Yeah, let's talk about synthetic data. So there's two ways to look at this, right? So one is the way in which synthetic data is being used today, which is not to increase the volume of training data, but it's actually to overcome limitations in the quality of the training data that we do have.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

So for instance, if in a particular language, there's too little data, you can try to augment that, or you can try to have a model, you know, solve a bunch of mathematical equations, throw that into the training data. And so for the next training run, that's going to be part of the pre training. And so the model will get better at doing that.

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

And the other way to look at synthetic data is, okay, you take 1 trillion tokens, you train a model on it, and then you output 10 trillion tokens, so you get to the next bigger model, and then you use that to output 100 trillion tokens. I'll bet that that's just not going to happen. That's just a snake eating its own tail, and...

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
20VC: AI Scaling Myths: More Compute is not the Answer | The Core Bottlenecks in AI Today: Data, Algorithms and Compute | The Future of Models: Open vs Closed, Small vs Large with Arvind Narayanan, Professor of Computer Science @ Princeton

What we've learned in the last two years is that the quality of data matters a lot more than the quantity of data. So if you're using synthetic data to try to augment the quantity, I think it's just coming at the expense of quality. You're not learning new things from the data. You're only learning things that are already there.