Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Memory, right?
Memory, right?
Memory, right?
We need to go through a lot of specific technical things of transformers to make this easy for people.
We need to go through a lot of specific technical things of transformers to make this easy for people.
We need to go through a lot of specific technical things of transformers to make this easy for people.
Yeah, so the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together.
Yeah, so the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together.
Yeah, so the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together.
These words, query, key, and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reweighting. My background's not in information retrieval and things like this. It's just fun to have backlinks. And
These words, query, key, and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reweighting. My background's not in information retrieval and things like this. It's just fun to have backlinks. And
These words, query, key, and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reweighting. My background's not in information retrieval and things like this. It's just fun to have backlinks. And
What effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models.
What effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models.
What effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models.
You predict one token at a time. You start with whatever your prompt was. You ask a question like who was the president in 1825? the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value matrices.
You predict one token at a time. You start with whatever your prompt was. You ask a question like who was the president in 1825? the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value matrices.
You predict one token at a time. You start with whatever your prompt was. You ask a question like who was the president in 1825? the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value matrices.
But the math is very nice so that when you're doing this repeatedly, this KV cache, this key value matrix, operation, you can keep appending the new values to it. So you keep track of what your previous values you're inferring over in this autoregressive chain. You keep it in memory the whole time. And this is a really crucial thing to manage when serving inference at scale.
But the math is very nice so that when you're doing this repeatedly, this KV cache, this key value matrix, operation, you can keep appending the new values to it. So you keep track of what your previous values you're inferring over in this autoregressive chain. You keep it in memory the whole time. And this is a really crucial thing to manage when serving inference at scale.