Nathan Lambert

👤 Speaker

1665 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

2718.245 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

2718.245 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

2718.245 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

2738.616 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

2738.616 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

2738.616 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

2879.714 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

2879.714 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

2879.714 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

2900.708 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

2900.708 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

2900.708 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

2919.288 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

2919.288 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

2919.288 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

2941.691 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

2941.691 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

2941.691 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?

3127.024 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?

3127.024 View full episode →

← Previous Page 4 of 84 Next →

Report any issue