Reiner Pope

And so this paper, reversible networks, like applied to some layer, like a transformer layer, for example, we've got this function F, which is our transformer layer.

7857.218 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Now, normally we would have just an input and then a residual connection coming out and it gets added like this.

7867.294 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

over here.

7875.307 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But now the variation of this is going to be we've got two inputs, x and y. So we've got x and y inputs.

7877.489 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

x goes through the function, gets added to y. And then this becomes the new x, the output x.

7890.183 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then this X becomes the output Y. So really what this is doing, this is actually sort of doing, if you think of two layers back, this is actually the thing you mentioned before.

7904.095 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's actually doing the residual connection from two layers back.

7918.254 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Like this Y came from the previous layer and was the residual connection there.

7921.037 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But because of this construction, the whole thing is invertible.