Jonathan Ross
👤 PersonAppearances Over Time
Podcast Appearances
And they'll have some of their own data and that'll make them subtly better at one thing or another. But they're largely all the same. More GPUs, the better the model because you can train on more tokens. It's the scaling law. This model was supposedly trained on a smaller number of GPUs and a much, much tighter budget.
I think the way that it's been put is less than the salary of many of the executives at Meta, and that's not true. There's an element of marketing involved in the DeepSea release. It is true that they train the model on approximately $6 million for the GPUs, right? They claim 2000
I think the way that it's been put is less than the salary of many of the executives at Meta, and that's not true. There's an element of marketing involved in the DeepSea release. It is true that they train the model on approximately $6 million for the GPUs, right? They claim 2000
I think the way that it's been put is less than the salary of many of the executives at Meta, and that's not true. There's an element of marketing involved in the DeepSea release. It is true that they train the model on approximately $6 million for the GPUs, right? They claim 2000
GPUs for, I think it was 60 days, which by the way, also don't forget was about the same amount of GPU time, 4,000 GPUs for 30 days as the original, I believe Lama 70. Now more recently, Meta has been training on more GPUs, but Meta hasn't been using as much good data as DeepSeq because DeepSeq was doing reinforcement learning using OpenAI.
GPUs for, I think it was 60 days, which by the way, also don't forget was about the same amount of GPU time, 4,000 GPUs for 30 days as the original, I believe Lama 70. Now more recently, Meta has been training on more GPUs, but Meta hasn't been using as much good data as DeepSeq because DeepSeq was doing reinforcement learning using OpenAI.
GPUs for, I think it was 60 days, which by the way, also don't forget was about the same amount of GPU time, 4,000 GPUs for 30 days as the original, I believe Lama 70. Now more recently, Meta has been training on more GPUs, but Meta hasn't been using as much good data as DeepSeq because DeepSeq was doing reinforcement learning using OpenAI.
Yes, exactly.
Yes, exactly.
Yes, exactly.
It's a little bit like speaking to someone who's smarter and getting tutored by someone who's smarter. You actually do better than if you're speaking to someone who's not as knowledgeable about the area or giving you wrong answers. First of all, before we get into any of this, I need to start with the scaling laws. These are like the physics of LLMs.
It's a little bit like speaking to someone who's smarter and getting tutored by someone who's smarter. You actually do better than if you're speaking to someone who's not as knowledgeable about the area or giving you wrong answers. First of all, before we get into any of this, I need to start with the scaling laws. These are like the physics of LLMs.
It's a little bit like speaking to someone who's smarter and getting tutored by someone who's smarter. You actually do better than if you're speaking to someone who's not as knowledgeable about the area or giving you wrong answers. First of all, before we get into any of this, I need to start with the scaling laws. These are like the physics of LLMs.
And there's a particular curve and the more tokens, which are sort of the syllables of an LLM, they don't match up exactly with human syllables, but kind of. So the more tokens that you train on, the better the model gets. But there's sort of these asymptotic returns where it starts trailing off.
And there's a particular curve and the more tokens, which are sort of the syllables of an LLM, they don't match up exactly with human syllables, but kind of. So the more tokens that you train on, the better the model gets. But there's sort of these asymptotic returns where it starts trailing off.
And there's a particular curve and the more tokens, which are sort of the syllables of an LLM, they don't match up exactly with human syllables, but kind of. So the more tokens that you train on, the better the model gets. But there's sort of these asymptotic returns where it starts trailing off.
The thing about the scaling law that everyone forgets, and that's why everyone was talking about how it's like the end of the scaling law, we're out of data on the internet, there's nothing left. What most people don't realize is that assumes that the data quality is uniform. If the data quality is better, then you can actually get away with training on fewer tokens.
The thing about the scaling law that everyone forgets, and that's why everyone was talking about how it's like the end of the scaling law, we're out of data on the internet, there's nothing left. What most people don't realize is that assumes that the data quality is uniform. If the data quality is better, then you can actually get away with training on fewer tokens.
The thing about the scaling law that everyone forgets, and that's why everyone was talking about how it's like the end of the scaling law, we're out of data on the internet, there's nothing left. What most people don't realize is that assumes that the data quality is uniform. If the data quality is better, then you can actually get away with training on fewer tokens.
So going back to my background, one of the fun things that I got to witness, I wasn't directly involved, was AlphaGo. Google beat the world champion, Lee Sedol, in Go. That model was trained on a bunch of existing games. But later on, they created a new one called AlphaGo Zero, which was trained on no existing games. It just played against itself. So how do you play against yourself and win?