Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so there's no equivalence of any of this.
This is all research.
There's some subtle, very subtle that I think are very hard to understand reasons why it's not trivial.
So if I can just describe one.
Why can't we just synthetically generate and train on it?
Well, because every synthetic example, like if I just give synthetic generation of the model thinking about a book, you look at it and you're like, this looks great.
Why can't I train on it?
Well, you could try, but the model will actually get much worse if you continue trying.
And that's because all of the samples you get from models are silently collapsed.
They're silently, this is not obvious if you look at any individual example of it, they occupy a very tiny manifold of the possible space of sort of thoughts about content.
So the LLMs, when they come off, they're what we call collapsed.
They have a collapsed data distribution.
If you sample, one easy way to say it is go to ChatGPT and ask it, tell me a joke.
It only has like three jokes.
It's not giving you the whole breadth of possible jokes.
It's giving you like, it knows like three jokes.
They're silently collapsed.
So basically, you're not getting the richness and diversity and the entropy from these models as you would get from humans.
So humans are a lot more sort of noisier, but at least they're not biased.
They're not in a statistical sense.