Nathan Lambert
๐ค SpeakerAppearances Over Time
Podcast Appearances
I can explain that. So today, if you use a model, like you look at an API, OpenAI charges a certain price per million tokens, right? And that price for input and output tokens is different, right? And the reason is that when you're inputting a query into the model, right? Let's say you have a book, right? That book, you must now calculate the entire KV cache for it, right? This key value cache.
And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.
And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.
And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.
But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.
But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.
But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.
Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.
Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.
Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.
and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?
and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?
and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?
i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.
i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.
i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.
It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?
It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?
It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?
Now, instead, I might only have like, hey, do this task, or I might have all these documents. But at the end of the day, the model is not just like producing a little bit, right? It's producing tons. Tons of information, this chain of thought just continues to go and go and go and go.