Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Nathan Lambert

๐Ÿ‘ค Person
1665 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?