Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
We basically just did completely oblivious random sampling.
So, uh, you know, we have like however many GPUs and we just sample a bunch of random data sets, um, of fixed size, uh, and train in parallel across all of the GPUs.
Absolutely.
There's a little bit of, I would say, manual science there in the sense that we sort of had this intuition that, OK, if we're doing a linear regression, then probably we want something that ranges from minus infinity to infinity.
We thought about doing a logistic regression, but then we thought that's kind of throwing away some amount of information because 0 or 1 is a much noisier signal than your exact loss or something like that.
And so we ended up settling on what we call this correct class margin, which is just the logit of the correct class minus the next highest logit.
And I think there's still, I wouldn't say that's super, it's not clear that that's the optimal thing to track.
But we generally have these sort of principles that, OK, it should be something that ranges the whole real number line.
It should be something pretty dense with information, so not throwing away bits from the actual logits as much as possible.
And so using all of that stuff, we kind of settled on this correct class margin.
Yeah, it's a great question.
I think the core thing to realize here is that when you do this data set level analysis, the thing you're actually studying is not a specific model, but it's actually a learning algorithm.
So you're not studying a specific ResNet.
You're really studying the process of training a model with a ResNet with SGD, which produces a whole class of models, basically.
And whatever features you extract or whatever analysis you do is really in aggregate over that whole class of models.
And I think the distinction becomes kind of important when you're sort of asking, it makes it harder to apply this to questions where you care about like this one specific model.
Yeah, so I think in our original paper, we had a couple of what I now think are slightly more toy applications.
ItWorks, for example, is a pretty good similarity metric between examples.
We used it to find train test leakage and things like that.
We used it actually to do a deep learning analog of, there's this really nice paper out of MIT a few years ago where they studied, it was a statistics paper, but they studied a bunch of