Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And the nice thing is, once you've sort of done this linearization, what you have, this f hat, is now a linear function in your parameter vector theta.
And when you have a linear function in your parameter vector theta, the influence function comes back on the table.
Now, there are a couple of tricks that we need to do here.
Because first of all, like I was saying earlier, we're trying to study the learning algorithm, not necessarily a specific model.
But the thing I just described is only a single model.
And so what we actually need to do is apply this whole process of train a model, get the final parameters, do a Taylor approximation over and over and over, and compute the influence function over and over and over again, and then ensemble the results.
The second thing we need to do is these gradient vectors that we're going to use to compute the influence functions are huge.
They're the number of parameters in the neural network, because the linear approximation we've made is linear in the parameter space of the neural network.
And so to get around that, what we originally proposed was randomly projecting those gradient vectors down to some manageable dimension before applying the influence function.
And so those two extra tricks, ensembling over several trained models and randomly projecting to a smaller dimension,
turn TRACK into an actual tractable estimator for data models rather than like a theoretical conception.
And what's interesting is that after releasing TRACK and sort of doing a bunch of experiments, we've actually found that like, even though we originally did these random projections as like a time slash space saving, you know,
cost-cutting measure, basically.
It turns out that the random projections themselves are doing something very non-trivial and kind of important for track.
And so understanding that exactly would be very nice.
Can you expand on that?
Yeah, so generally when you're doing random projections of different vectors, you think of it as sort of like a lossy compression metric, basically.
What we wanted is this influence function computed on gradient vectors.
We can't do that because the gradient vectors are too big.
So let's instead randomly project these gradient vectors down to a manageable dimension, and then let's do the influence function on those.