Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

πŸ‘€ Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

If I think of a forward passive training, so I will, let's say I have four layers.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

I run them in the 0, 1, 2, 3 order.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

I have to write all of the activations to HBM.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And so I get an HBM footprint here that is kind of like linear in number of layers.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Yep.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So this actually can be the largest memory footprint during training.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And so this is normal training.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And then I run the backwards pass and I read it kind of in reverse.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Like I run them sort of forward pass goes forward, backward pass goes backwards.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

And I have to read them back out.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

the idea of this RevNet's paper is that because it's invertible, I don't need to store this at all.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

I can completely rematerialize it when I'm running my backwards pass.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So I run my forwards pass, and then when I'm running my backwards pass, I'm simultaneously in lockstep undoing all of the forwards pass steps that I did in order to have the activations that I need here.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

So this ends up being a memory saving, which is a nice idea.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Spending more memory to save compute is generally profitable given where hardware is at.

Dwarkesh Podcast
Reiner Pope – The math behind how LLMs are trained and served

Yeah, interesting.

← Previous Page 58 of 58 Next β†’