Dylan Patel

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's going to have pages and pages of this. It's almost too much to actually read, but it's nice to skim as it's coming.

1995.489 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.

2066.11 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.

2066.11 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.

2066.11 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.

2271.76 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.

2271.76 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.

2271.76 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.

2283.662 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.

2283.662 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.

2283.662 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.

2299.286 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.

2299.286 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.

2299.286 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.

2358.951 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.

2358.951 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.

2358.951 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.

2376.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.

2376.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.

2376.927 View full episode →

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.

2395.898 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment