Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
But the math is very nice so that when you're doing this repeatedly, this KV cache, this key value matrix, operation, you can keep appending the new values to it. So you keep track of what your previous values you're inferring over in this autoregressive chain. You keep it in memory the whole time. And this is a really crucial thing to manage when serving inference at scale.
There are far bigger experts in this, and there are so many levels of detail that you can go into. Essentially, one of the key quote-unquote drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length.
There are far bigger experts in this, and there are so many levels of detail that you can go into. Essentially, one of the key quote-unquote drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length.
There are far bigger experts in this, and there are so many levels of detail that you can go into. Essentially, one of the key quote-unquote drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length.
So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like sub-quadratic or linear attention forms, which is like state-space models. We don't need to go down all these now.
So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like sub-quadratic or linear attention forms, which is like state-space models. We don't need to go down all these now.
So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like sub-quadratic or linear attention forms, which is like state-space models. We don't need to go down all these now.
And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.
And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.
And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.
They help with memory constraint and performance. So if you put in a book into... I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and... Sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things.
They help with memory constraint and performance. So if you put in a book into... I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and... Sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things.
They help with memory constraint and performance. So if you put in a book into... I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and... Sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things.
One, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context performance work. This implies the data. There's subtle changes to these computations and attention. And it changes the architecture.
One, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context performance work. This implies the data. There's subtle changes to these computations and attention. And it changes the architecture.
One, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context performance work. This implies the data. There's subtle changes to these computations and attention. And it changes the architecture.
But serving long context is extremely memory-consuming. constrained, especially when you're making a lot of predictions. I actually don't know why input and output tokens are more expensive, but I think essentially output tokens, you have to do more computation because you have to sample from the model.
But serving long context is extremely memory-consuming. constrained, especially when you're making a lot of predictions. I actually don't know why input and output tokens are more expensive, but I think essentially output tokens, you have to do more computation because you have to sample from the model.
But serving long context is extremely memory-consuming. constrained, especially when you're making a lot of predictions. I actually don't know why input and output tokens are more expensive, but I think essentially output tokens, you have to do more computation because you have to sample from the model.
So these are features that APIs are shipping, which is like, prompt caching, pre-filling, because you can drive prices down and you can make APIs much faster. If you know you're going to keep, if you run a business and you're going to keep passing the same initial content to Cloud's API, you can load that in to the Anthropic API and always keep it there.