Nathan Lambert
๐ค PersonAppearances Over Time
Podcast Appearances
And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?
And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?
And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?
It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.
It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.
It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.
Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.
Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.
Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.
There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,
There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,
There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,
When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.
When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.
When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.
I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.
I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.
I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.
I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?
I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?