Jonathan Ross
👤 PersonAppearances Over Time
Podcast Appearances
And then it picks some small number, I'm forgetting which, maybe it's like eight of those or 16 of them, whatever it is. And so it only needs to do the compute for that. That means that you're getting to skip most of it, right? Sort of like your brain, like not every neuron in your brain fires when I say something to you about the stock market, right?
And then it picks some small number, I'm forgetting which, maybe it's like eight of those or 16 of them, whatever it is. And so it only needs to do the compute for that. That means that you're getting to skip most of it, right? Sort of like your brain, like not every neuron in your brain fires when I say something to you about the stock market, right?
And then it picks some small number, I'm forgetting which, maybe it's like eight of those or 16 of them, whatever it is. And so it only needs to do the compute for that. That means that you're getting to skip most of it, right? Sort of like your brain, like not every neuron in your brain fires when I say something to you about the stock market, right?
Like the neurons about, you know, playing football, right? those don't kick off, right? That's the intuition there. Previously, it was famously reported that OpenAI's GPT-4, it started off with something like 16 experts and they got it down to eight. I forget the numbers, but it started off larger and they shrunk it a little and they were smaller or whatever.
Like the neurons about, you know, playing football, right? those don't kick off, right? That's the intuition there. Previously, it was famously reported that OpenAI's GPT-4, it started off with something like 16 experts and they got it down to eight. I forget the numbers, but it started off larger and they shrunk it a little and they were smaller or whatever.
Like the neurons about, you know, playing football, right? those don't kick off, right? That's the intuition there. Previously, it was famously reported that OpenAI's GPT-4, it started off with something like 16 experts and they got it down to eight. I forget the numbers, but it started off larger and they shrunk it a little and they were smaller or whatever.
And then with what's happened with DeepSeq model is they've gone the opposite. They've gone to a very large number of experts. The more parameters you have, it's like having more neurons. It's easier to retain the information that comes in. And so by having more parameters, they're able to, on a smaller amount of data, get good.
And then with what's happened with DeepSeq model is they've gone the opposite. They've gone to a very large number of experts. The more parameters you have, it's like having more neurons. It's easier to retain the information that comes in. And so by having more parameters, they're able to, on a smaller amount of data, get good.
And then with what's happened with DeepSeq model is they've gone the opposite. They've gone to a very large number of experts. The more parameters you have, it's like having more neurons. It's easier to retain the information that comes in. And so by having more parameters, they're able to, on a smaller amount of data, get good.
However, because it's sparse, because it's a mixture of experts, they're not doing as much computation. And part of the cleverness was figuring out how they could have so many experts so it could be so sparse so they could skip so many of the parameters.
However, because it's sparse, because it's a mixture of experts, they're not doing as much computation. And part of the cleverness was figuring out how they could have so many experts so it could be so sparse so they could skip so many of the parameters.
However, because it's sparse, because it's a mixture of experts, they're not doing as much computation. And part of the cleverness was figuring out how they could have so many experts so it could be so sparse so they could skip so many of the parameters.
outperformed their 405. What was surprising to me, I thought they retrained it from scratch. It turns out you read the paper and they talk about how they just fine tuned. So they used a relatively small amount of data to make it much better. Again, this goes to the quality of the data. They have higher quality data. They took their old model. They trained it, got much better.
outperformed their 405. What was surprising to me, I thought they retrained it from scratch. It turns out you read the paper and they talk about how they just fine tuned. So they used a relatively small amount of data to make it much better. Again, this goes to the quality of the data. They have higher quality data. They took their old model. They trained it, got much better.
outperformed their 405. What was surprising to me, I thought they retrained it from scratch. It turns out you read the paper and they talk about how they just fine tuned. So they used a relatively small amount of data to make it much better. Again, this goes to the quality of the data. They have higher quality data. They took their old model. They trained it, got much better.
But that 70B, that new 70B outperforms their previous 405B. What you're going to see now is now that everyone has seen this deep seek architecture, they're going to go, great, I have hundreds of thousands of GPUs. I'm now going to use a lot of them to create a lot of synthetic data. And then I'm going to train the bejesus out of this model.
But that 70B, that new 70B outperforms their previous 405B. What you're going to see now is now that everyone has seen this deep seek architecture, they're going to go, great, I have hundreds of thousands of GPUs. I'm now going to use a lot of them to create a lot of synthetic data. And then I'm going to train the bejesus out of this model.
But that 70B, that new 70B outperforms their previous 405B. What you're going to see now is now that everyone has seen this deep seek architecture, they're going to go, great, I have hundreds of thousands of GPUs. I'm now going to use a lot of them to create a lot of synthetic data. And then I'm going to train the bejesus out of this model.
Because the other thing is, while it's sort of asymptotes, the question is, on this curve, where do you stop? It depends on how many people you have doing inference. You can either make the model bigger, which makes it more expensive, and then you train it on less. Or you make it smaller, and it's cheaper to run, but you have to train it more. So DeepSeq didn't have a lot of users until recently.
Because the other thing is, while it's sort of asymptotes, the question is, on this curve, where do you stop? It depends on how many people you have doing inference. You can either make the model bigger, which makes it more expensive, and then you train it on less. Or you make it smaller, and it's cheaper to run, but you have to train it more. So DeepSeq didn't have a lot of users until recently.