Dylan Patel

👤 Speaker

See mentions of this person in podcasts

3551 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that

2764.795 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that

2764.795 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that

2764.795 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

2784.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

2784.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

2784.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.

2804.695 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.

2804.695 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.

2804.695 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.

2824.067 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.

2824.067 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.

2824.067 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.

2842.58 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.

2842.58 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.

2842.58 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.

2860.095 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.

2860.095 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.

2860.095 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And it adds up.

2878.493 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And it adds up.

2878.493 View full episode →

← Previous Page 111 of 178 Next →

Report any issue