Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

πŸ‘€ Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So you said around two kilobytes.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So let's just do a sanity check for what this could be.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

There are two mechanisms that people do attention with a small number of bytes per token.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

One is dense attention with a lot of reuse across layers.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So Character AI has a blog post talking about that, alternating long and short context.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And in the Character AI kind of model, which also showed up in the Gemma models,

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

the global context which is really what we're talking about here global context was shared across all the layers and so to get this two kilobytes you could get that for example as a d head of 128 is typical and then like the number of bytes is typically number of attention layers times

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

two times D head times number of Q heads.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So this is the number of unique contexts per layer.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Do you share the context across many layers or do you use it only once?

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So in character AI-like models, this number is one.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

We said this is 128.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And this is a choice which typically ranges from one, sorry, this is KV heads, I meant.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

there is written ahead in the kv head is that the kv heads are the heads that are stored in memory like store the contents of the previous tokens the q heads are the um the retrieval heads there they're only used temporarily and they're they're used by the attending token so um in this water aggressive context i've got kv heads associated with all of the context and then q heads associated with this new token here

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

But this head, the 128.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Oh, this is... This number is actually the same for... Oh, sorry.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

This d-head is the dimension of the vector.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And number of kv-heads is typically in the range of 1 to 8.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So, like, it is totally plausible to get this by, for example, having 8 kv-heads and a d-head of 128.