Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
If I think of a forward passive training, so I will, let's say I have four layers.
I run them in the 0, 1, 2, 3 order.
I have to write all of the activations to HBM.
And so I get an HBM footprint here that is kind of like linear in number of layers.
Yep.
So this actually can be the largest memory footprint during training.
And so this is normal training.
And then I run the backwards pass and I read it kind of in reverse.
Like I run them sort of forward pass goes forward, backward pass goes backwards.
And I have to read them back out.
the idea of this RevNet's paper is that because it's invertible, I don't need to store this at all.
I can completely rematerialize it when I'm running my backwards pass.
So I run my forwards pass, and then when I'm running my backwards pass, I'm simultaneously in lockstep undoing all of the forwards pass steps that I did in order to have the activations that I need here.
So this ends up being a memory saving, which is a nice idea.
Yeah.
Spending more memory to save compute is generally profitable given where hardware is at.
Yeah, interesting.