Nathan Lambert
๐ค PersonAppearances Over Time
Podcast Appearances
And one of those things is that instead of just calling the NVIDIA library nickel, they instead scheduled their own communications, which some of the labs do. Emeta talked about in Lama 3 how they made their own custom version of nickel. They didn't talk about the implementation details. This is some of what they did.
And one of those things is that instead of just calling the NVIDIA library nickel, they instead scheduled their own communications, which some of the labs do. Emeta talked about in Lama 3 how they made their own custom version of nickel. They didn't talk about the implementation details. This is some of what they did.
Probably not as well as, maybe not as well as DeepSeq because DeepSeq, necessity is the mother of innovation and they had to do this. Whereas in the case, you know, OpenAI has people that do this sort of stuff, Anthropic, et cetera.
Probably not as well as, maybe not as well as DeepSeq because DeepSeq, necessity is the mother of innovation and they had to do this. Whereas in the case, you know, OpenAI has people that do this sort of stuff, Anthropic, et cetera.
Probably not as well as, maybe not as well as DeepSeq because DeepSeq, necessity is the mother of innovation and they had to do this. Whereas in the case, you know, OpenAI has people that do this sort of stuff, Anthropic, et cetera.
But, you know, DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to. And so they scheduled communications, you know, by scheduling specific SMs. SMs you could think of as like the core on a GPU. Right. So there's hundreds of cores or there's, you know, a bit over 100 cores, SMs on a GPU.
But, you know, DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to. And so they scheduled communications, you know, by scheduling specific SMs. SMs you could think of as like the core on a GPU. Right. So there's hundreds of cores or there's, you know, a bit over 100 cores, SMs on a GPU.
But, you know, DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to. And so they scheduled communications, you know, by scheduling specific SMs. SMs you could think of as like the core on a GPU. Right. So there's hundreds of cores or there's, you know, a bit over 100 cores, SMs on a GPU.
And they were specifically scheduling, hey, which ones are running the model, which ones are doing all reduce, which one are doing all gather. Right. And they would flip back and forth between them. And this requires extremely low level programming. This is what Nickel does automatically or other NVIDIA libraries handle this automatically, usually. Yeah, exactly.
And they were specifically scheduling, hey, which ones are running the model, which ones are doing all reduce, which one are doing all gather. Right. And they would flip back and forth between them. And this requires extremely low level programming. This is what Nickel does automatically or other NVIDIA libraries handle this automatically, usually. Yeah, exactly.
And they were specifically scheduling, hey, which ones are running the model, which ones are doing all reduce, which one are doing all gather. Right. And they would flip back and forth between them. And this requires extremely low level programming. This is what Nickel does automatically or other NVIDIA libraries handle this automatically, usually. Yeah, exactly.
And so technically they're using, you know, PTX, which is like sort of like you could think of it as like an assembly type language. It's not exactly that or instruction set, right? Like coding directly to assembly instruction set. It's not exactly that, but that's still part of technically CUDA. But it's like, do I want to write in Python, you know, PyTorch equivalent and call NVIDIA libraries?
And so technically they're using, you know, PTX, which is like sort of like you could think of it as like an assembly type language. It's not exactly that or instruction set, right? Like coding directly to assembly instruction set. It's not exactly that, but that's still part of technically CUDA. But it's like, do I want to write in Python, you know, PyTorch equivalent and call NVIDIA libraries?
And so technically they're using, you know, PTX, which is like sort of like you could think of it as like an assembly type language. It's not exactly that or instruction set, right? Like coding directly to assembly instruction set. It's not exactly that, but that's still part of technically CUDA. But it's like, do I want to write in Python, you know, PyTorch equivalent and call NVIDIA libraries?
Do I want to go down to the C level? right? Or, you know, encode even lower level? Or do I want to go all the way down to the assembly or ISO level? And, and there are cases where you go all the way down there at the very big labs, but most companies just do not do that, right? Because it's a waste of time. And the efficiency gains you get are not worth it.
Do I want to go down to the C level? right? Or, you know, encode even lower level? Or do I want to go all the way down to the assembly or ISO level? And, and there are cases where you go all the way down there at the very big labs, but most companies just do not do that, right? Because it's a waste of time. And the efficiency gains you get are not worth it.
Do I want to go down to the C level? right? Or, you know, encode even lower level? Or do I want to go all the way down to the assembly or ISO level? And, and there are cases where you go all the way down there at the very big labs, but most companies just do not do that, right? Because it's a waste of time. And the efficiency gains you get are not worth it.
But deep seeks implementation is so complex, right? Especially with their mixture of experts, right? People have done mixture of experts, but they're generally eight, 16 experts, right? And they activate two. So, you know, one of the words that we like to use is like sparsity factor, right? Or usage, right? So you might have four, you know, one fourth of your model activate, right?
But deep seeks implementation is so complex, right? Especially with their mixture of experts, right? People have done mixture of experts, but they're generally eight, 16 experts, right? And they activate two. So, you know, one of the words that we like to use is like sparsity factor, right? Or usage, right? So you might have four, you know, one fourth of your model activate, right?
But deep seeks implementation is so complex, right? Especially with their mixture of experts, right? People have done mixture of experts, but they're generally eight, 16 experts, right? And they activate two. So, you know, one of the words that we like to use is like sparsity factor, right? Or usage, right? So you might have four, you know, one fourth of your model activate, right?