Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Dylan Patel

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
3551 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the