Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Dwarkesh Podcast

The data black hole at the center of AI

19 Jun 2026

Transcription

Transcript generated automatically by AI and may contain errors.

Chapter 1: What is really driving AI progress?

0.031 - 13.205 Dwarkesh Patel

So one definition of intelligence is sample efficiency. That is to say, how much data do you need in a given domain to operate fluently and competently? And it's actually not clear that we've made that much progress in training sample efficiency over the last few years.

0

Chapter 2: How do human and AI sample efficiency compare?

13.745 - 28.465 Dwarkesh Patel

It seems like more so we've just dramatically widened and improved the data distribution. The main way that AIs have been getting better is from adding more and better data and scaling the compute required to develop that data in the first place. Obviously, RL is the main way that this has happened.

0

28.485 - 40.865 Dwarkesh Patel

You can think of RL as basically a kind of synthetic data generation where you dump a ton of compute against a verifier or a rubric if you have another one as a judge, and you do this in order to find out what the good data is in the first place.

0

40.885 - 47.976 Dwarkesh Patel

Then you train your model to predict these correct rollouts much in the same way that you might train that model to predict the next word in internet text.

0

Chapter 3: Why is sample efficiency important for AI development?

47.956 - 65.897 Dwarkesh Patel

For this process to work, the model must have at least some prior probability to anticipate the correct solution in the first place, which is why you need mind-stretching amounts of human expert trajectories in every single field and skill that you want the model to eventually be competent in. It's hard to overstate how task-specific and bespoke this human expert data is.

0

65.877 - 83.126 Dwarkesh Patel

If you want some intuition, I recommend checking out the job descriptions on Mercor or Serge's websites. There are listings for word specialists who will convert legacy documents into polished word files, and legal experts who will write realistic M&A diligences or securities filings, and management consultants who will write up template market research.

0

83.727 - 95.731 Dwarkesh Patel

And it is not only that the data have to be so domain-specific, but there has to be so much of it. Each skill corresponds to at least hundreds of human experts who are generating example completions, writing rubrics, and explaining their chain of thought.

0

96.373 - 115.33 Dwarkesh Patel

There's a reason that the data industry that is producing these expert labels and the RL environments in which these meticulously cataloged skills can congeal is earning billions a year in revenue, soon to be deca-billions. Now imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a Word file.

0

115.811 - 124.423 Dwarkesh Patel

Even the task count difference here understates the gap because the models have to grind their far more numerous tasks, each far harder. Whereas a human student might practice a textbook problem once or twice.

Chapter 4: What role does data play in AI training and performance?

124.884 - 137.222 Dwarkesh Patel

With GRPO, these models are generating hundreds to thousands of rollouts per task, and they need to solve the credit assignment problem. The correct way to think about these models is not like a human who has learned all these different skills that you see these models displaying.

0

137.603 - 159.013 Dwarkesh Patel

It's more like a Frankenstein's monster which has been built out of a billion graphs of carefully constructed examples all sewn together. Epoch recently reported that open models lag state-of-the-art frontier models by four months. I think the reason it is relatively easy for open source and previous laggards to catch up to within months of the frontier is that data is the real driver of progress.

0

158.993 - 179.888 Dwarkesh Patel

And data can be easily distilled from public APIs, whereas hyperparameters and training tricks and architectural optimizations cannot. And if the latter were driving most of the progress, then catching up would be far harder than we are observing it to be. It is easy to forget how much data these models are trained on and how much more it is than what we humans see in our lifetimes.

0

179.868 - 196.468 Dwarkesh Patel

We see these AIs as a galaxy glittering with capabilities, but at their center, invisible to the naked eye, holding all the constellations together is an unimaginably massive black hole of data. Just a couple of points of comparison to help drive home how big this difference is. Here's one.

0

197.109 - 216.157 Dwarkesh Patel

If a person sees and hears on average, let's say generously, 2000 words an hour, then between the time they're born and the time they're an adult, they'll see about 200 million tokens. Now, by contrast, these frontier models are trained on somewhere between tens to hundreds of trillions of tokens. That is close to a million fold difference. Here's another point of comparison.

216.778 - 235.608 Dwarkesh Patel

If you wanted to, you could learn to tell or operate any random humanoid or robot arm within hours. And if we could get AIs to learn just as fast, robotics would be a deca-trillion-dollar industry, and you'd have an endless army of unitary G1s doing all kinds of useful work in the world. But the reason we can't do this is that our AIs learn much less efficiently than we do.

236.229 - 248.029 Dwarkesh Patel

And even with the millions of hours of demonstrations that we collected, this is not enough to allow them to perform complex open-ended tasks. And a final point of comparison, a teenager can learn to drive a car with about 20 hours of practice.

248.009 - 264.592 Dwarkesh Patel

And even if we include their 16 years of growing up and understanding how the world works and building physical intuition, there's still three to four orders of magnitude less data than Waymo and Tesla are using to train their self-driving car models. Now I want to deal with a couple of common responses and objections that people have to these kinds of comparisons.

264.612 - 284.097 Dwarkesh Patel

One thing people will say, and I think Karpathy said this when he came on to my podcast, is that for humans, many billions of years of evolution had to go into basically pre-training us. And so we're being unfair when we're comparing how little data we see within our lifetimes to what these cold-started LLMs, who are just starting off with a totally random initialization, have to learn from.

Comments

There are no comments yet.

Please log in to write the first comment.