Jonathan Ross
👤 PersonAppearances Over Time
Podcast Appearances
Today, there's this wonderful business selling mainframes with a pretty juicy margin because no one seems to want to enter that business. Training is a niche market with very high margins. And when I say niche, it's still going to be worth hundreds of billions a year. But inference is the larger market. And...
Today, there's this wonderful business selling mainframes with a pretty juicy margin because no one seems to want to enter that business. Training is a niche market with very high margins. And when I say niche, it's still going to be worth hundreds of billions a year. But inference is the larger market. And...
I don't know that NVIDIA will ever see it this way, but I do think that those of us focusing on inference and building stuff specifically for that are probably the best thing that's ever happened for NVIDIA stock because we'll take on the low margin, high volume inference so that NVIDIA can keep its margins nice and high.
I don't know that NVIDIA will ever see it this way, but I do think that those of us focusing on inference and building stuff specifically for that are probably the best thing that's ever happened for NVIDIA stock because we'll take on the low margin, high volume inference so that NVIDIA can keep its margins nice and high.
I don't know that NVIDIA will ever see it this way, but I do think that those of us focusing on inference and building stuff specifically for that are probably the best thing that's ever happened for NVIDIA stock because we'll take on the low margin, high volume inference so that NVIDIA can keep its margins nice and high.
No. And I was actually like, we raised some money late 2024. In that fundraise, we still had to explain to people why inference was going to be a larger business than training. Remember, this was our thesis when we started eight years ago. So for me, I struggle on why people think that training is going to be bigger. It just doesn't make sense.
No. And I was actually like, we raised some money late 2024. In that fundraise, we still had to explain to people why inference was going to be a larger business than training. Remember, this was our thesis when we started eight years ago. So for me, I struggle on why people think that training is going to be bigger. It just doesn't make sense.
No. And I was actually like, we raised some money late 2024. In that fundraise, we still had to explain to people why inference was going to be a larger business than training. Remember, this was our thesis when we started eight years ago. So for me, I struggle on why people think that training is going to be bigger. It just doesn't make sense.
Training is where you create the model. Inference is where you use the model. You want to become a heart surgeon, you spend years training, and then you spend more years practicing. Practicing is inference.
Training is where you create the model. Inference is where you use the model. You want to become a heart surgeon, you spend years training, and then you spend more years practicing. Practicing is inference.
Training is where you create the model. Inference is where you use the model. You want to become a heart surgeon, you spend years training, and then you spend more years practicing. Practicing is inference.
what you're going to see is everyone else starting to use this MOE approach. Now, there's another thing that happens here.
what you're going to see is everyone else starting to use this MOE approach. Now, there's another thing that happens here.
what you're going to see is everyone else starting to use this MOE approach. Now, there's another thing that happens here.
Yeah, so MOE stands for mixture of experts. When you use LAMA 70 billion, you actually use every single parameter in that model. When you use Mixtrals 8x7b, you use two of the roughly 8b experts, but it's much smaller. And effectively, while it doesn't correlate exactly, it correlates very closely. The number of parameters effectively tells you how much compute you're performing.
Yeah, so MOE stands for mixture of experts. When you use LAMA 70 billion, you actually use every single parameter in that model. When you use Mixtrals 8x7b, you use two of the roughly 8b experts, but it's much smaller. And effectively, while it doesn't correlate exactly, it correlates very closely. The number of parameters effectively tells you how much compute you're performing.
Yeah, so MOE stands for mixture of experts. When you use LAMA 70 billion, you actually use every single parameter in that model. When you use Mixtrals 8x7b, you use two of the roughly 8b experts, but it's much smaller. And effectively, while it doesn't correlate exactly, it correlates very closely. The number of parameters effectively tells you how much compute you're performing.
Now, if I have, let's take the R1 model. I believe it's about 671 billion parameters versus 70 billion for LAMA. And there's a 405 billion dense model as well, right? But let's focus on 70 versus 671. I believe there's 256 experts, each of which is somewhere around 2 billion parameters.
Now, if I have, let's take the R1 model. I believe it's about 671 billion parameters versus 70 billion for LAMA. And there's a 405 billion dense model as well, right? But let's focus on 70 versus 671. I believe there's 256 experts, each of which is somewhere around 2 billion parameters.
Now, if I have, let's take the R1 model. I believe it's about 671 billion parameters versus 70 billion for LAMA. And there's a 405 billion dense model as well, right? But let's focus on 70 versus 671. I believe there's 256 experts, each of which is somewhere around 2 billion parameters.