Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think there's a data collection step, which is really very modality dependent.
But you can think of it as where are you even getting data from?
So we have to choose, for example, to get ImageNet from Flickr.
Or for language data, we have to choose what kind of crawler are we going to build?
Are we going to respect?
robots.txt, are we going to try to generate synthetic data?
And so I call that like the data collection process almost.
And then once you've done that, you sort of have to decide how you're going to package this into a data set.
And so for image classification, that's a question of like, well, which data are you going to include and how are you going to label it?
For language modeling, you also have to decide which data you're going to include, what sort of data cleaning you're going to do and stuff like that.
And then after you've done that, you have a data set, and that data set gets fed into a learning algorithm.
And you can think of a learning algorithm as anything that takes in a data set and outputs a machine learning model.
And so that encapsulates everything, like hyperparameters, hyperparameter tuning, architecture, architecture search, whatever you're doing to get your machine learning model is included in that learning algorithm.
And then even after you've done that, there's this step that I think, you know, maybe not so much anymore, but often used to get overlooked, which is how you actually deploy this machine learning system, the machine learning model.
I think, you know, increasingly machine learning models are deployed into contexts where they're dealing with humans.
And so you inevitably run into questions like, you know, what are adversaries going to be able to do with this model?
but also what are normal behaving users going to do with this model?
And I think there are tons of interesting and funny failure modes of not thinking about the last step in this pipeline over the last couple of years.
Yeah, absolutely.
And I mean, also the system itself is changing and getting better and you're constantly upgrading or making your model bigger.