Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there is some time to fetch all of the total number of parameters, not just the active parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there's weight fetch time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then in addition, there's a kvcache fetch time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this actually depends on batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So for every element of the batch, we have to fetch an entire context length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

worth of tokens, and then there's a size per token.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So like bytes, bytes for one token.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there's a model parameter.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So when I do a forward pass, let me draw actually how the autoregressive inference works.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is during decode.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if I think I have a bunch of tokens of text, I'm drawing a tensor because ultimately the tokens are represented as some like tensor of in some embedding dimension.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then in this direction, I have the sequence length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The work of running a decode is I have to run each token through a whole bunch of matrix multipliers over a bunch of different layers.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And in general, I'm going to have to do that work over all of these tokens.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But then one step of decode is actually to produce just this one additional token out here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so what I'm going to do there is I'm going to run a full forwards pass of multiplying by all of the white matrices in the entire model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But then I've got this attention mechanism where this token sort of, it's like looking at all of the past tokens in this way.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And what is it looking at specifically?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It is looking at some internal representation that the model has produced of the tokens.