Ilya Shumailov

👤 Person

29 appearances

Podcast Appearances

When AI Cannibalizes Its Data

Quite a lot of these models, especially back at the time, they're relatively low quality. So there are errors and there are biases. There are systematic biases inside of those models. And thus, you can kind of imagine the case where rather than learning useful contexts, and useful concepts, you can actually learn things that don't exist. They are purely hallucinations.

136.718 View full episode →

When AI Cannibalizes Its Data

In simple theoretical setups, we consider it, you're guaranteed to collapse.

176.363 View full episode →

When AI Cannibalizes Its Data

So there are three sources, three primary sources of error that we still have. So the very first one is basically just data-associated errors. And usually those are questions along the lines of, do we have enough data to approximate a given process? So if some things happen very infrequently in your underlying distribution, your model may get a wrong perception that, like,

235.583 View full episode →

When AI Cannibalizes Its Data

that some things are impossible wait what do you mean by they are impossible like an example i've seen on twitter was uh if you google for a baby peacock you'll discover pictures of birds that look relatively realistic but they are not peacocks at all they are completely generated and you will not find a real picture but if you try learning anything from it of course you're

263.939 View full episode →

When AI Cannibalizes Its Data

I've got to be absorbing this bias.

289.558 View full episode →

When AI Cannibalizes Its Data

Exactly. And those are the kinds of errors that you don't normally see that often because they are so improbable, right? And if people are going to start reporting things to you and saying, oh, your model is wrong here, they're likely to notice things that on average are wrong.

305.337 View full episode →

When AI Cannibalizes Its Data

But if they're wrong in some small part of the internet that nobody really cares about, then it's very unlikely that you will even notice that you're making a mistake. And usually this is the problem because... As the number of dimensions grow, you will discover that the volume in the tails is going to grow disproportionately.

320.068 View full episode →

When AI Cannibalizes Its Data

Yeah, exactly. So as a result, you'll discover that you need to capture quite a bit.

342.688 View full episode →

When AI Cannibalizes Its Data

On top of it, we have errors that come from learning regimes and from the models themselves. So on learning regimes, we are all training our models. All of them are structurally biased. So basically to say that your model is going to be good, But it's unlikely to be optimal. So it's likely to have some errors somewhere. And this was the error source number two.

352.222 View full episode →

When AI Cannibalizes Its Data

And error source number three is that the actual model design, what shape and form your model should be taking, is very much alchemy. Nobody really knows why stuff works. We kind of just know empirically stuff works.

376.454 View full episode →

When AI Cannibalizes Its Data

Yeah, which parts of the model are responsible for what? We don't know the fundamental underlying bias of a given model architecture. What we observe is that there is always some sort of an error that is introduced by those architectures.

398.365 View full episode →

When AI Cannibalizes Its Data

Exactly. And then we also have empirical errors from, for example, hardware. So we also have practical limitations of hardware with which we work. And those errors also exist.

425.181 View full episode →

When AI Cannibalizes Its Data

Yes, certainly. So what we observe in simple theoretical models is that two main phenomena happen. The very first phenomena that happens is it's really hard to approximate improbable events, in part because you don't encounter them very often. So you may discover that you're collecting more and more data, and a lot of this data looks very similar to what you already possess.

447.526 View full episode →

When AI Cannibalizes Its Data

So you're not discovering too much information. But importantly, you're not discovering those infrequent data points. So those tail events, they kind of disappear. And then the other thing that happens is that the first time you made this error and underestimated your improbable events... When you hit the model on top of this, it's unlikely to recover from this taking place.

470.953 View full episode →

When AI Cannibalizes Its Data

Originally improbable events are even more improbable for the subsequent model, and it kind of like snowballs out of control until the whole thing just collapses fully to near zero variance.

506.331 View full episode →

When AI Cannibalizes Its Data

Exactly. And the thing is, you can theoretically describe this. It's actually very simple. And you can run these experiments however many times you want. And you'll discover that even if you have a lot of data, if you keep on repeating this process, and the rate at which this collapses, you can also bound, you end up always in a state where your improbable events kind of disappear.

523.446 View full episode →

When AI Cannibalizes Its Data

In practice, when we grab large language models, we observe that they become more confident in the predictions that they are making. So basically, the improbable events here are going to be things that the model is not very confident about, and normally it would not make predictions about it.

545.469 View full episode →

When AI Cannibalizes Its Data

So when you're trying to generate more data out of a language model in order for another language model to learn from it, over time, basically, it becomes more and more confident. And then it basically, during the generation setup, it gets stuck very often in these repetitive loops.

563.465 View full episode →

When AI Cannibalizes Its Data

a person the same two sentences and then like the next person says the same two sentences and it usually gets like more and more garbled as it goes down the line i think this is a comparison kind of kind of works yes so this is the first thing it's the improbable events and then the second thing that happens is your models are going to produce errors so misunderstandings of the underlying phenomenon right and as a result

588.056 View full episode →

When AI Cannibalizes Its Data

Large language models are statistical beasts that learn from example of human written text and learn to produce text that is similar to the ones that the model was taught.

59.105 View full episode →

When AI Cannibalizes Its Data

what you will see is that those errors start propagating as well. And they are relatively correlated. If all of your models are using the same architecture, then it's likely to be correlatedly wrong in the same kinds of way. So whenever it sees errors, it may amplify the same errors that it's observing.

612.779 View full episode →

When AI Cannibalizes Its Data

Yeah, so approximations of approximations of approximations end up being very imprecise. As long as you can bound the errors of your approximations, it's okay, I guess. But yeah, in practice, because machine learning is very empiric, quite often we can't.

659.045 View full episode →

When AI Cannibalizes Its Data

Yeah. So an important thing to say here is that... The settings we talk about here are relatively hypothetical in a sense that we are not in the world in which, you know, today we can build a model and tomorrow they disappear. That is not going to happen. We already have very good models and the way forward is having even better models. And there is no doubts about this.

679.473 View full episode →

When AI Cannibalizes Its Data

I mean, there are many different solutions. You'll find a lot of different papers that are exploring what are the most effective mitigations. And it's mostly data filtering of different kinds. And basically making sure that the data that ends up being ingested by the models is representative of the underlying data distribution.

719.353 View full episode →

When AI Cannibalizes Its Data

And whenever we hit this limit and we see that our model diverges into some sort of a training direction,

740.811 View full episode →

When AI Cannibalizes Its Data

trajectory that is making the model worse i promise you people will stop training of the models retract back a couple of steps maybe add additional data of certain kind and keep on training right because we can always go back to previous models nothing stopping us and then we can always spend more effort getting high quality data paying more people to create high quality data

746.896 View full episode →

When AI Cannibalizes Its Data

Yeah, so model collapse is not going to magically kill the models tomorrow. We just need to change the way we build stuff. So this is not all doom and gloom. I am quite confident we'll solve this problem.

770.842 View full episode →

When AI Cannibalizes Its Data

Thank you very much for having me. It was a pleasure. Thank you.

790.487 View full episode →

When AI Cannibalizes Its Data

If you were today to sample data from internet randomly, I'm sure you'll find that a bigger proportion of it is generated by machines. But this is not to say that the data itself is bad. The main question is how much of this data is generated

97.791 View full episode →

Report any issue