Menu
Sign In Pricing Add Podcast
Podcast Image

Short Wave

When AI Cannibalizes Its Data

Tue, 18 Feb 2025

Description

Asked ChatGPT anything lately? Talked with a customer service chatbot? Read the results of Google's "AI Overviews" summary feature? If you've used the Internet lately, chances are, you've consumed content created by a large language model. These models, like DeepSeek-R1 or OpenAI's ChatGPT, are kind of like the predictive text feature in your phone on steroids. In order for them to "learn" how to write, the models are trained on millions of examples of human-written text. Thanks in part to these same large language models, a lot of content on the Internet today is written by generative AI. That means that AI models trained nowadays may be consuming their own synthetic content ... and suffering the consequences.View the AI-generated images mentioned in this episode.Have another topic in artificial intelligence you want us to cover? Let us know my emailing [email protected]!Listen to every episode of Short Wave sponsor-free and support our work at NPR by signing up for Short Wave+ at plus.npr.org/shortwave.Learn more about sponsor message choices: podcastchoices.com/adchoicesNPR Privacy Policy

Audio
Transcription

Chapter 1: What is the main topic of this episode?

Chapter 2: How do large language models learn?

59.105 - 74.131 Ilya Shumailov

Large language models are statistical beasts that learn from example of human written text and learn to produce text that is similar to the ones that the model was taught.

0

74.551 - 96.41 Regina Barber

That's Ilya Shumailov. He's a computer scientist and he says in order to teach these models, scientists have to train them on a lot of human written examples. Like, they basically make the models read the entire internet. which works for a while. But nowadays, thanks in part to these same large language models, a lot of the content on our internet is written by generative AI.

0

97.791 - 115.674 Ilya Shumailov

If you were today to sample data from internet randomly, I'm sure you'll find that a bigger proportion of it is generated by machines. But this is not to say that the data itself is bad. The main question is how much of this data is generated

0

120.31 - 135.866 Regina Barber

In the spring of 2023, Elio was a research fellow at the University of Oxford. And he and his brother were talking over lunch. They were like, OK, if the Internet is full of machine-generated content and that machine-generated content goes into future machines, what's going to happen?

0

136.718 - 158.526 Ilya Shumailov

Quite a lot of these models, especially back at the time, they're relatively low quality. So there are errors and there are biases. There are systematic biases inside of those models. And thus, you can kind of imagine the case where rather than learning useful contexts, and useful concepts, you can actually learn things that don't exist. They are purely hallucinations.

Chapter 3: What are the consequences of AI consuming its own content?

158.646 - 171.575 Regina Barber

Ilya and his team did this research study indicating that eventually, any large language model that learns from its own synthetic data would start to degrade over time, producing results that got worse and worse and worse.

0

176.363 - 181.248 Ilya Shumailov

In simple theoretical setups, we consider it, you're guaranteed to collapse.

0

182.21 - 197.166 Regina Barber

So today on the show, AI model collapse. What happens when a large language model reads too much of its own content? And could it limit the future of generative AI? I'm Regina Barber, and you're listening to ShoreWave, the science podcast from NPR.

0

201.448 - 217.695 Advertisement Narrator

This message comes from WISE, the app for doing things and other currencies. With WISE, you can send, spend, or receive money across borders, all at a fair exchange rate. No markups or hidden fees. Join millions of customers and visit WISE.com. T's and C's apply.

0

219.679 - 234.818 Regina Barber

OK, Ilya, before we get into the big problem of like model collapse, I think we need to understand why these errors are actually happening. So can you explain to me what kinds of errors do you get from a large language model and like how do they happen? Why do they happen?

235.583 - 263.356 Ilya Shumailov

So there are three sources, three primary sources of error that we still have. So the very first one is basically just data-associated errors. And usually those are questions along the lines of, do we have enough data to approximate a given process? So if some things happen very infrequently in your underlying distribution, your model may get a wrong perception that, like,

263.939 - 288.978 Ilya Shumailov

that some things are impossible wait what do you mean by they are impossible like an example i've seen on twitter was uh if you google for a baby peacock you'll discover pictures of birds that look relatively realistic but they are not peacocks at all they are completely generated and you will not find a real picture but if you try learning anything from it of course you're

289.558 - 291.38 Ilya Shumailov

I've got to be absorbing this bias.

291.42 - 304.455 Regina Barber

Right. You're like telling me now that there's like a lot of fake baby peacock images, but machines don't know that. Right. They're just going to think, great, this is a baby peacock. And also there's not that many like real baby peacock images to compare it to.

Chapter 4: What types of errors occur in large language models?

320.068 - 337.603 Ilya Shumailov

But if they're wrong in some small part of the internet that nobody really cares about, then it's very unlikely that you will even notice that you're making a mistake. And usually this is the problem because... As the number of dimensions grow, you will discover that the volume in the tails is going to grow disproportionately.

0

337.803 - 342.168 Regina Barber

Not just babies, but baby birds. Not just baby birds, but baby peacocks.

0

342.688 - 347.273 Ilya Shumailov

Yeah, exactly. So as a result, you'll discover that you need to capture quite a bit.

0

347.333 - 351.358 Regina Barber

Okay, so that's one kind of problem, a data problem. What are the other two?

0

352.222 - 375.914 Ilya Shumailov

On top of it, we have errors that come from learning regimes and from the models themselves. So on learning regimes, we are all training our models. All of them are structurally biased. So basically to say that your model is going to be good, But it's unlikely to be optimal. So it's likely to have some errors somewhere. And this was the error source number two.

376.454 - 388.985 Ilya Shumailov

And error source number three is that the actual model design, what shape and form your model should be taking, is very much alchemy. Nobody really knows why stuff works. We kind of just know empirically stuff works.

389.025 - 397.552 Regina Barber

It's like a black box. We don't know how it's making these decisions. We don't know where, like you said, in that order, it's fixing those decisions. Yeah.

398.365 - 412.876 Ilya Shumailov

Yeah, which parts of the model are responsible for what? We don't know the fundamental underlying bias of a given model architecture. What we observe is that there is always some sort of an error that is introduced by those architectures.

412.916 - 424.205 Regina Barber

Right, right. Okay, so the three places errors could come from is like, one, the model itself, two, the way it's trained, right? And three, the data or the lack of data that it's trained on.

Chapter 5: What are the three sources of error in language models?

506.331 - 516.941 Ilya Shumailov

Originally improbable events are even more improbable for the subsequent model, and it kind of like snowballs out of control until the whole thing just collapses fully to near zero variance.

0

516.961 - 522.586 Regina Barber

So instead of this bell curve, you just have like a point in the middle. You just have a whole bunch of stuff in the middle.

0

523.446 - 545.008 Ilya Shumailov

Exactly. And the thing is, you can theoretically describe this. It's actually very simple. And you can run these experiments however many times you want. And you'll discover that even if you have a lot of data, if you keep on repeating this process, and the rate at which this collapses, you can also bound, you end up always in a state where your improbable events kind of disappear.

0

545.469 - 563.005 Ilya Shumailov

In practice, when we grab large language models, we observe that they become more confident in the predictions that they are making. So basically, the improbable events here are going to be things that the model is not very confident about, and normally it would not make predictions about it.

0

563.465 - 577.511 Ilya Shumailov

So when you're trying to generate more data out of a language model in order for another language model to learn from it, over time, basically, it becomes more and more confident. And then it basically, during the generation setup, it gets stuck very often in these repetitive loops.

577.651 - 587.075 Regina Barber

I know this isn't exactly the same, but it makes me think of the telephone game. You know, when you tell somebody a phrase or a couple sentences, and then the next person tells you.

588.056 - 611.898 Ilya Shumailov

a person the same two sentences and then like the next person says the same two sentences and it usually gets like more and more garbled as it goes down the line i think this is a comparison kind of kind of works yes so this is the first thing it's the improbable events and then the second thing that happens is your models are going to produce errors so misunderstandings of the underlying phenomenon right and as a result

612.779 - 631.051 Ilya Shumailov

what you will see is that those errors start propagating as well. And they are relatively correlated. If all of your models are using the same architecture, then it's likely to be correlatedly wrong in the same kinds of way. So whenever it sees errors, it may amplify the same errors that it's observing.

Chapter 6: How do biases affect AI-generated content?

Chapter 7: What can we expect for the future of generative AI?

201.448 - 217.695 Advertisement Narrator

This message comes from WISE, the app for doing things and other currencies. With WISE, you can send, spend, or receive money across borders, all at a fair exchange rate. No markups or hidden fees. Join millions of customers and visit WISE.com. T's and C's apply.

0

219.679 - 234.818 Regina Barber

OK, Ilya, before we get into the big problem of like model collapse, I think we need to understand why these errors are actually happening. So can you explain to me what kinds of errors do you get from a large language model and like how do they happen? Why do they happen?

0

235.583 - 263.356 Ilya Shumailov

So there are three sources, three primary sources of error that we still have. So the very first one is basically just data-associated errors. And usually those are questions along the lines of, do we have enough data to approximate a given process? So if some things happen very infrequently in your underlying distribution, your model may get a wrong perception that, like,

0

263.939 - 288.978 Ilya Shumailov

that some things are impossible wait what do you mean by they are impossible like an example i've seen on twitter was uh if you google for a baby peacock you'll discover pictures of birds that look relatively realistic but they are not peacocks at all they are completely generated and you will not find a real picture but if you try learning anything from it of course you're

0

289.558 - 291.38 Ilya Shumailov

I've got to be absorbing this bias.

291.42 - 304.455 Regina Barber

Right. You're like telling me now that there's like a lot of fake baby peacock images, but machines don't know that. Right. They're just going to think, great, this is a baby peacock. And also there's not that many like real baby peacock images to compare it to.

305.337 - 319.708 Ilya Shumailov

Exactly. And those are the kinds of errors that you don't normally see that often because they are so improbable, right? And if people are going to start reporting things to you and saying, oh, your model is wrong here, they're likely to notice things that on average are wrong.

320.068 - 337.603 Ilya Shumailov

But if they're wrong in some small part of the internet that nobody really cares about, then it's very unlikely that you will even notice that you're making a mistake. And usually this is the problem because... As the number of dimensions grow, you will discover that the volume in the tails is going to grow disproportionately.

337.803 - 342.168 Regina Barber

Not just babies, but baby birds. Not just baby birds, but baby peacocks.

Comments

There are no comments yet.

Please log in to write the first comment.