Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so I think that was really cool, both because it's sort of this very distilled model of bias in the data collection process that doesn't have any of the complexity of deep learning and allows you to actually prove stuff.
And how did it work?
The algorithm itself is quite simple.
You can actually just write down a loss function that captures this self-selection mechanism inside of the loss function.
Rather than just minimizing your standard loss function, if you minimize this alternative loss function instead, what we showed is that you'll recover the actual ground truth parameters of what matters for hunting and fishing in my example.
And then the technical challenges come in like proving that you can actually take gradients of this loss function and that doing gradient descent will actually converge and things like that.
We assume we're doing linear regression and we have this like very well-defined sort of structural model that we're sampling from.
So we have very strong assumptions on what the generative process of the data is.
We assume there's no model misspecification or anything like that.
We're really just trying to see whether we can recover the parameters underlying this generative process that we've assumed efficiently.
And so, obviously, tons of work to be done along that axis, both along robustness to model misspecification, less restrictive settings, more complex models.
But I think it's a really interesting sort of first stab at this question of can we efficiently recover from the data itself that we're collecting from people being biased for whatever reason?
Yeah, especially in like sort of outside of the like language model, like in sort of more classical machine learning settings where you're like a bank or something and you're collecting people's data.
Yeah, so I would say that the biggest adverse effects, like the biggest limitations sort of right now are exactly sort of this, you don't really know what's going to happen when the model is misspecified or when your model of how people are self-selecting is misspecified.
So if you're a bank and you assume that someone's going to strategize in this way or strategize in that way and you try to account for it in your machine learning algorithm, if your conception of how they reported data is wrong
then it's sort of unclear what parameters you're actually going to end up with.
Yeah, absolutely.
I think in general, this is exactly why we had to study this very restrictive sort of data generating process, where really your only goal is to recover the true parameters of this data generating process.
I think when you get to sort of this machine learning regime, we don't really have a data generating process anymore, or not one that we can write down at least.
And so it becomes a much trickier question of like, okay, we're going to do some accounting for self-selection or something like that, but we also want to account for