Alex Reisner
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
They probably have the strongest argument that their data could be used for other purposes.
But when you go back, they've been cited by over 10,000 papers.
I didn't read all 10,000, but I read a lot of them.
A lot of it is stuff that people wouldn't mind as much as with generative AI.
Common Crawl, I think without Common Crawl,
you know, AI translation tools might not be as good as they are.
I think it was really a huge help because they scraped web pages, the same page in multiple languages, and people were able to train translation models based on that.
But the thing that, you know, there is still a, what I would call a data laundering network where the AI companies are still relying on
they'll do a collaboration with the university and they'll have universe the university download you know millions of images to train a model or download millions of articles to train a model and the ai company can say like well we didn't do it this was like an academic thing um you know the same goes common crawl is not the only non-profit that's like doing a lot of this scraping for the ai industry one of the data sets i reported on in the music the article about music
Training Data is this organization based in Europe called Lyon.
They have a data set of 12 million songs from YouTube.
So anyway, this is like, is it academic?
Technically, yeah, there's universities and nonprofits, but they're all receiving money from the AI industry.
YouTube is an extremely common source.
I think one reason is there are tools for downloading from YouTube that work really well.