Arvid Lundmark
👤 PersonAppearances Over Time
Podcast Appearances
And then you can give a reward to the things that humans would like more and sort of punish the things that it won't like and sort of then train the model to output the suggestions that humans would like more. You have these like RL loops that are very useful that exploit these passive K-curves. Oman maybe can go into even more detail.
But MLA is from this company called DeepSeek. It's quite an interesting algorithm. Maybe the key idea is sort of in both MQA and in other places, what you're doing is you're sort of reducing the number of KV heads. The advantage you get from that is
But MLA is from this company called DeepSeek. It's quite an interesting algorithm. Maybe the key idea is sort of in both MQA and in other places, what you're doing is you're sort of reducing the number of KV heads. The advantage you get from that is
But MLA is from this company called DeepSeek. It's quite an interesting algorithm. Maybe the key idea is sort of in both MQA and in other places, what you're doing is you're sort of reducing the number of KV heads. The advantage you get from that is
there's less of them, but maybe the theory is that you actually want a lot of different, like you want each of the keys and values to actually be different. So one way to reduce the size is you keep one big shared vector for all the keys and values. And then you have smaller vectors for every single token, so that you can store only the smaller thing. There's some sort of low-rank reduction.
there's less of them, but maybe the theory is that you actually want a lot of different, like you want each of the keys and values to actually be different. So one way to reduce the size is you keep one big shared vector for all the keys and values. And then you have smaller vectors for every single token, so that you can store only the smaller thing. There's some sort of low-rank reduction.
there's less of them, but maybe the theory is that you actually want a lot of different, like you want each of the keys and values to actually be different. So one way to reduce the size is you keep one big shared vector for all the keys and values. And then you have smaller vectors for every single token, so that you can store only the smaller thing. There's some sort of low-rank reduction.
And the low-rank reduction, at the end of the time, when you eventually want to compute the final thing, remember that you're memory bound, which means that you still have some compute left that you can use for these things. So if you can expand the
And the low-rank reduction, at the end of the time, when you eventually want to compute the final thing, remember that you're memory bound, which means that you still have some compute left that you can use for these things. So if you can expand the
And the low-rank reduction, at the end of the time, when you eventually want to compute the final thing, remember that you're memory bound, which means that you still have some compute left that you can use for these things. So if you can expand the
um the latent vector back out and and somehow like this is far more efficient because just like you're reducing like for example maybe like reducing like 32 or something like the size of the vector that you're keeping yeah there's perhaps some richness in having a separate uh set of keys and values and query that kind of pairwise match up versus compressing that all into one and that interaction at least
um the latent vector back out and and somehow like this is far more efficient because just like you're reducing like for example maybe like reducing like 32 or something like the size of the vector that you're keeping yeah there's perhaps some richness in having a separate uh set of keys and values and query that kind of pairwise match up versus compressing that all into one and that interaction at least
um the latent vector back out and and somehow like this is far more efficient because just like you're reducing like for example maybe like reducing like 32 or something like the size of the vector that you're keeping yeah there's perhaps some richness in having a separate uh set of keys and values and query that kind of pairwise match up versus compressing that all into one and that interaction at least
But it also allows you to make your prompt bigger for certain.
But it also allows you to make your prompt bigger for certain.
But it also allows you to make your prompt bigger for certain.
Let's opine on bug finding.
Let's opine on bug finding.
Let's opine on bug finding.
To be clear, I think they sort of understand code really well. While they're being pre-trained, the representation that's being built up, almost certainly somewhere in the stream, the model knows that maybe there's something sketchy going on. It sort of has some sketchiness, but actually eliciting the sketchiness to...