Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Nathan Lambert

๐Ÿ‘ค Speaker
1665 total appearances

Appearances Over Time

Podcast Appearances

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

I can explain that. So today, if you use a model, like you look at an API, OpenAI charges a certain price per million tokens, right? And that price for input and output tokens is different, right? And the reason is that when you're inputting a query into the model, right? Let's say you have a book, right? That book, you must now calculate the entire KV cache for it, right? This key value cache.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?

Lex Fridman Podcast
#459 โ€“ DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Now, instead, I might only have like, hey, do this task, or I might have all these documents. But at the end of the day, the model is not just like producing a little bit, right? It's producing tons. Tons of information, this chain of thought just continues to go and go and go and go.