Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that
innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that
innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that
When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.
When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.
When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.
But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.
But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.
But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.
So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.
So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.
So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.
And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.
And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.
And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.
And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.
And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.
And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.
And it adds up.
And it adds up.