Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.
But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.
So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.
So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.
So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.
And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.
And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.
And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.
If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.
If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.
If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.
And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.
And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.
And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.
So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.
So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.
So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.
And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the
And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the
And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the