Alex Reisner
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
If you go back and read those early open AI papers, um, everyone was training on common crawl.
And at first, and the models were terrible because if you train a model on the, on the whole internet, it's just, you know, there's, you get all the, it says all the junk that people say on the internet along with, along with the intelligent things.
Statistically, yeah, mostly junk.
And I think the early large language models were proof of that.
But yeah, I think Common Crawl is a nonprofit, so they would argue it's not a big business.
They do get a lot of money from AI companies and AI investors.
But yeah, I think the topic of training data selection, the challenge of selecting the right data for a model is still really hard.
The AI companies, I would say, still have a very primitive understanding of AI.
what data will make their model better.
It's an area of research that I think even they at this stage are not very good at.
They do it mainly by trial and error, as far as I can tell.
Again, so the reason you're asking why do these datasets exist, I think it's people trying to share what they've learned from curating datasets in different ways and training models with them.
Yeah, that I think that I agree with most of that.
I do think that I'm not sure how much these data sets were really collected for other purposes.
Common crawl likes to talk about, I think they're, they're a one case, um,