Alex Reisner

It's a huge, you know, it's an extremely labor-intensive process to find, you know, in one sense, you just go and download all of Library Genesis or all of Anna's archive.

646.687 View full episode →

The Vergecast

How to train your data

That's one sort of naive thing you can do, but the companies realized pretty early on that

660.563 View full episode →

The Vergecast

How to train your data

they need to filter the stuff pretty carefully.

665.429 View full episode →

The Vergecast

How to train your data

So the organization you just mentioned, Common Crawl, yeah, they've been crawling the web since the late 2000s, maybe 2009 or something like that.

668.755 View full episode →

The Vergecast

How to train your data

And they just make the whole thing available.

678.752 View full episode →

The Vergecast

How to train your data

Every month there's a new, they've scraped a few more hundred million web pages.

682.559 View full episode →

The Vergecast

How to train your data

And it's available to anyone who wants to do any kind of research with it.

688.028 View full episode →

The Vergecast

How to train your data

In fact, it's mostly AI researchers who are using it.

691.957 View full episode →

The Vergecast

How to train your data

But all the early...

695.927 View full episode →

The Vergecast

How to train your data

large language models were trained on common crawl.

698.065 View full episode →

← Previous Page 3 of 8 Next →

Report any issue

Alex Reisner

Voice Profile Active

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment