Reiner Pope

So if I think I have a bunch of tokens of text, I'm drawing a tensor because ultimately the tokens are represented as some like tensor of in some embedding dimension.

370.42 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then in this direction, I have the sequence length.

381.68 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The work of running a decode is I have to run each token through a whole bunch of matrix multipliers over a bunch of different layers.

387.818 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And in general, I'm going to have to do that work over all of these tokens.

397.376 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But then one step of decode is actually to produce just this one additional token out here.

405.351 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so what I'm going to do there is I'm going to run a full forwards pass of multiplying by all of the white matrices in the entire model.

411.376 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But then I've got this attention mechanism where this token sort of, it's like looking at all of the past tokens in this way.

419.93 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And what is it looking at specifically?

428.024 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It is looking at some internal representation that the model has produced of the tokens.

430.668 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment