Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
It's going to have pages and pages of this. It's almost too much to actually read, but it's nice to skim as it's coming.
This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.
This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.
This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.
Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.
Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.
Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.
Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.
Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.
Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.
And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.
And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.
And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.
Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.
Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.
Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.
At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.
At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.
At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.
But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.