Alex Reisner

👤 Speaker

153 total appearances

Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1

Confidence: Medium

Appearances Over Time

Podcast Appearances

The Vergecast

How to train your data

If you go back and read those early open AI papers, um, everyone was training on common crawl.

700.489 View full episode →

The Vergecast

How to train your data

And at first, and the models were terrible because if you train a model on the, on the whole internet, it's just, you know, there's, you get all the, it says all the junk that people say on the internet along with, along with the intelligent things.

705.596 View full episode →

The Vergecast

How to train your data

Yeah.

717.333 View full episode →

The Vergecast

How to train your data

Um, mostly junk.

718.034 View full episode →

The Vergecast

How to train your data

Statistically, mostly junk.

720.237 View full episode →

The Vergecast

How to train your data

Statistically, yeah, mostly junk.

721.419 View full episode →

The Vergecast

How to train your data

And I think the early large language models were proof of that.

723.482 View full episode →

The Vergecast

How to train your data

Um,

726.867 View full episode →

The Vergecast

How to train your data

But yeah, I think Common Crawl is a nonprofit, so they would argue it's not a big business.

728.052 View full episode →

The Vergecast

How to train your data

They do get a lot of money from AI companies and AI investors.

735.462 View full episode →

The Vergecast

How to train your data

But yeah, I think the topic of training data selection, the challenge of selecting the right data for a model is still really hard.

741.07 View full episode →

The Vergecast

How to train your data

The AI companies, I would say, still have a very primitive understanding of AI.

751.144 View full episode →

The Vergecast

How to train your data

what data will make their model better.

757.786 View full episode →

The Vergecast

How to train your data

It's an area of research that I think even they at this stage are not very good at.

762.914 View full episode →

The Vergecast

How to train your data

They do it mainly by trial and error, as far as I can tell.

768.542 View full episode →

The Vergecast

How to train your data

Interesting.

773.71 View full episode →

The Vergecast

How to train your data

Again, so the reason you're asking why do these datasets exist, I think it's people trying to share what they've learned from curating datasets in different ways and training models with them.

774.812 View full episode →

The Vergecast

How to train your data

Yeah, that I think that I agree with most of that.

888.837 View full episode →

The Vergecast

How to train your data

I do think that I'm not sure how much these data sets were really collected for other purposes.

891.782 View full episode →

The Vergecast

How to train your data

Common crawl likes to talk about, I think they're, they're a one case, um,