Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Alex Reisner

πŸ‘€ Speaker
153 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

The Vergecast
How to train your data

They probably have the strongest argument that their data could be used for other purposes.

The Vergecast
How to train your data

But when you go back, they've been cited by over 10,000 papers.

The Vergecast
How to train your data

I didn't read all 10,000, but I read a lot of them.

The Vergecast
How to train your data

And they are mostly AI.

The Vergecast
How to train your data

And it's early.

The Vergecast
How to train your data

A lot of it is stuff that people wouldn't mind as much as with generative AI.

The Vergecast
How to train your data

Common Crawl, I think without Common Crawl,

The Vergecast
How to train your data

you know, AI translation tools might not be as good as they are.

The Vergecast
How to train your data

I think it was really a huge help because they scraped web pages, the same page in multiple languages, and people were able to train translation models based on that.

The Vergecast
How to train your data

So that was helpful.

The Vergecast
How to train your data

But the thing that, you know, there is still a, what I would call a data laundering network where the AI companies are still relying on

The Vergecast
How to train your data

they'll do a collaboration with the university and they'll have universe the university download you know millions of images to train a model or download millions of articles to train a model and the ai company can say like well we didn't do it this was like an academic thing um you know the same goes common crawl is not the only non-profit that's like doing a lot of this scraping for the ai industry one of the data sets i reported on in the music the article about music

The Vergecast
How to train your data

Training Data is this organization based in Europe called Lyon.

The Vergecast
How to train your data

They have a data set of 12 million songs from YouTube.

The Vergecast
How to train your data

So anyway, this is like, is it academic?

The Vergecast
How to train your data

Not really.

The Vergecast
How to train your data

Technically, yeah, there's universities and nonprofits, but they're all receiving money from the AI industry.

The Vergecast
How to train your data

Yeah, that's accurate.

The Vergecast
How to train your data

YouTube is an extremely common source.

The Vergecast
How to train your data

I think one reason is there are tools for downloading from YouTube that work really well.