Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, absolutely.
I think, you know, machine teaching, I've heard data set distillation as well.
Oh, yeah.
And, you know, a couple of those core set finding, I think is another name.
But I think there are a variety of these applications where what we really care about is like narrowing down the data or like preserving the information that's present in the data set while cutting down on the number of data points.
And so I think it's a really interesting complementary goal to data modeling, or almost one that could benefit from data models being used.
And I'm sure we'll talk about this later, but I think that where I'm most excited about things like data models being used is in tasks where you can write down an optimization function in terms of your training data.
So if you can really write down like, you know, I want to minimize over my training set some function of model predictions, anything that looks like that, I think, you know, just plugging in what something like data models predicts will happen can be a really powerful primitive.
And so I'd be excited to see what happens for machine teaching, for example.
Yeah, I would say it works surprisingly well.
It's obviously not perfect, but across a bunch of data sets we've tried, what you can look at to evaluate data models is actually just basically the correlation between what you say is going to happen according to your data model and reality, which you can access by just training the model yourself on that data set.
And so there are two sort of interesting ways of evaluating these.
One is that we fit the data models by sampling a bunch of random data sets, training a model on each one, and then fitting basically a linear regression from the 1, 0 encoding of the data set we trained on to a model's output on a specific target.
um it doesn't have to be a linear regression but in our paper we did a linear regression and so the nice thing is this is just a linear regression problem you can sample a bunch of new subsets you can you know see what your data model predicts you can measure reality you can compute the correlation and so there as i was saying there are two interesting ways of evaluating this one is this like on distribution evaluation where you sample data sets from the same distribution of data sets that you use to fit the parameters
And there, data models do amazingly well, way better than we ever would have thought.
On CIFAR-10, for example, the correlation between data model predictions and reality is 0.9 or something like that, which is strikingly linear, I think.
When you go to non-random data sets, because I think that's sort of the more interesting question is like, does this work once you leave this distribution of data sets that you sampled?
The answer is yes, but slightly worse.
So the correlation degrades a little bit.
But trend-wise, we tried a bunch of different ways of breaking the data model.