Nathan Lambert
👤 PersonAppearances Over Time
Podcast Appearances
And the model will learn which expert to route to for different tasks. And so this is a humongous innovation in terms of, hey, I can continue to grow the total embedding space of parameters. And so DeepSeq's model is, you know, 600 something billion parameters, right? Relative to LAMA-405b, it's 405 billion parameters, right? Relative to LAMA-70b, it's 70 billion parameters, right?
So this model technically has more embedding space for information, right? To compress all of the world's knowledge that's on the internet down. But at the same time, it is only activating around 37 billion of the parameters. So only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it.
So this model technically has more embedding space for information, right? To compress all of the world's knowledge that's on the internet down. But at the same time, it is only activating around 37 billion of the parameters. So only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it.
So this model technically has more embedding space for information, right? To compress all of the world's knowledge that's on the internet down. But at the same time, it is only activating around 37 billion of the parameters. So only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it.
And so versus, again, the LAMA model, 70 billion parameters must be activated or 405 billion parameters must be activated. So you've dramatically reduced your compute cost when you're doing training and inference. with this mixture of experts architecture.
And so versus, again, the LAMA model, 70 billion parameters must be activated or 405 billion parameters must be activated. So you've dramatically reduced your compute cost when you're doing training and inference. with this mixture of experts architecture.
And so versus, again, the LAMA model, 70 billion parameters must be activated or 405 billion parameters must be activated. So you've dramatically reduced your compute cost when you're doing training and inference. with this mixture of experts architecture.
Effectively, NVIDIA builds this library called Nickel, right? In which, you know, when you're training a model, you have all these communications between every single layer of the model, and you may have over 100 layers. What does Nickel stand for? It's NCCL? NVIDIA Communications Collectives Library. Nice. And so...
Effectively, NVIDIA builds this library called Nickel, right? In which, you know, when you're training a model, you have all these communications between every single layer of the model, and you may have over 100 layers. What does Nickel stand for? It's NCCL? NVIDIA Communications Collectives Library. Nice. And so...
Effectively, NVIDIA builds this library called Nickel, right? In which, you know, when you're training a model, you have all these communications between every single layer of the model, and you may have over 100 layers. What does Nickel stand for? It's NCCL? NVIDIA Communications Collectives Library. Nice. And so...
When you're training a model, you're going to have all these all-reduces and all-gathers. Between each layer, between the multi-layer perceptron or feed-forward network and the attention mechanism, you'll have basically the model synchronized. Or you'll have all-reducer and all-gather. And this is a communication between all the GPUs in the network, whether it's in training or inference.
When you're training a model, you're going to have all these all-reduces and all-gathers. Between each layer, between the multi-layer perceptron or feed-forward network and the attention mechanism, you'll have basically the model synchronized. Or you'll have all-reducer and all-gather. And this is a communication between all the GPUs in the network, whether it's in training or inference.
When you're training a model, you're going to have all these all-reduces and all-gathers. Between each layer, between the multi-layer perceptron or feed-forward network and the attention mechanism, you'll have basically the model synchronized. Or you'll have all-reducer and all-gather. And this is a communication between all the GPUs in the network, whether it's in training or inference.
So NVIDIA has a standard library. This is one of the reasons why it's really difficult to use anyone else's hardware. for training is because no one's really built a standard communications library. And NVIDIA has done this at a sort of a higher level, right?
So NVIDIA has a standard library. This is one of the reasons why it's really difficult to use anyone else's hardware. for training is because no one's really built a standard communications library. And NVIDIA has done this at a sort of a higher level, right?
So NVIDIA has a standard library. This is one of the reasons why it's really difficult to use anyone else's hardware. for training is because no one's really built a standard communications library. And NVIDIA has done this at a sort of a higher level, right?
DeepSeq, because they have certain limitations around the GPUs that they have access to, the interconnects are limited to some extent by the restrictions of the GPUs that were shipped into China legally, not the ones that are smuggled, but legally shipped in that they use to train this model. They had to figure out how to get efficiencies,
DeepSeq, because they have certain limitations around the GPUs that they have access to, the interconnects are limited to some extent by the restrictions of the GPUs that were shipped into China legally, not the ones that are smuggled, but legally shipped in that they use to train this model. They had to figure out how to get efficiencies,
DeepSeq, because they have certain limitations around the GPUs that they have access to, the interconnects are limited to some extent by the restrictions of the GPUs that were shipped into China legally, not the ones that are smuggled, but legally shipped in that they use to train this model. They had to figure out how to get efficiencies,
And one of those things is that instead of just calling the NVIDIA library nickel, they instead scheduled their own communications, which some of the labs do. Emeta talked about in Lama 3 how they made their own custom version of nickel. They didn't talk about the implementation details. This is some of what they did.