Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Alex Reisner

πŸ‘€ Speaker
153 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

The Vergecast
How to train your data

If you go back and read those early open AI papers, um, everyone was training on common crawl.

The Vergecast
How to train your data

And at first, and the models were terrible because if you train a model on the, on the whole internet, it's just, you know, there's, you get all the, it says all the junk that people say on the internet along with, along with the intelligent things.

The Vergecast
How to train your data

Yeah.

The Vergecast
How to train your data

Um, mostly junk.

The Vergecast
How to train your data

Statistically, mostly junk.

The Vergecast
How to train your data

Statistically, yeah, mostly junk.

The Vergecast
How to train your data

And I think the early large language models were proof of that.

The Vergecast
How to train your data

Um,

The Vergecast
How to train your data

But yeah, I think Common Crawl is a nonprofit, so they would argue it's not a big business.

The Vergecast
How to train your data

They do get a lot of money from AI companies and AI investors.

The Vergecast
How to train your data

But yeah, I think the topic of training data selection, the challenge of selecting the right data for a model is still really hard.

The Vergecast
How to train your data

The AI companies, I would say, still have a very primitive understanding of AI.

The Vergecast
How to train your data

what data will make their model better.

The Vergecast
How to train your data

It's an area of research that I think even they at this stage are not very good at.

The Vergecast
How to train your data

They do it mainly by trial and error, as far as I can tell.

The Vergecast
How to train your data

Interesting.

The Vergecast
How to train your data

Again, so the reason you're asking why do these datasets exist, I think it's people trying to share what they've learned from curating datasets in different ways and training models with them.

The Vergecast
How to train your data

Yeah, that I think that I agree with most of that.

The Vergecast
How to train your data

I do think that I'm not sure how much these data sets were really collected for other purposes.

The Vergecast
How to train your data

Common crawl likes to talk about, I think they're, they're a one case, um,