Reiner Pope

the global context which is really what we're talking about here global context was shared across all the layers and so to get this two kilobytes you could get that for example as a d head of 128 is typical and then like the number of bytes is typically number of attention layers times

5950.173 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

two times D head times number of Q heads.

5978.489 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is the number of unique contexts per layer.

5988.884 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Do you share the context across many layers or do you use it only once?

5992.908 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So in character AI-like models, this number is one.

5998.433 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We said this is 128.

6004.219 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And this is a choice which typically ranges from one, sorry, this is KV heads, I meant.

6007.242 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

there is written ahead in the kv head is that the kv heads are the heads that are stored in memory like store the contents of the previous tokens the q heads are the um the retrieval heads there they're only used temporarily and they're they're used by the attending token so um in this water aggressive context i've got kv heads associated with all of the context and then q heads associated with this new token here

6016.272 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But this head, the 128.

6039.543 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Oh, this is... This number is actually the same for... Oh, sorry.

6041.926 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This d-head is the dimension of the vector.