Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Alex Reisner

πŸ‘€ Speaker
153 total appearances
Voice ID

Voice Profile Active

This person's voice can be automatically recognized across podcast episodes using AI voice matching.

Voice samples: 1
Confidence: Medium

Appearances Over Time

Podcast Appearances

The Vergecast
How to train your data

But then even at the companies, you know, AI, the AI world is a little bit like academia in that it is good for your career to publish papers.

The Vergecast
How to train your data

And the companies don't like that, but they also acknowledge that they have to let the employees publish something.

The Vergecast
How to train your data

And so the lawyers will go over it and tell them what they can and can't say.

The Vergecast
How to train your data

And over time, they've clamped down more.

The Vergecast
How to train your data

And so the companies are revealing less through the research papers.

The Vergecast
How to train your data

But yeah, a lot of my research is just from reading a Google paper.

The Vergecast
How to train your data

For example, for this last article, they said we trained on tens of millions of songs.

The Vergecast
How to train your data

Yeah, it's not 2021 anymore.

The Vergecast
How to train your data

The research papers read very differently now than they did a few years ago.

The Vergecast
How to train your data

Mainly, well, I mean, for AI training, why else would it exist?

The Vergecast
How to train your data

It's a huge, you know, it's an extremely labor-intensive process to find, you know, in one sense, you just go and download all of Library Genesis or all of Anna's archive.

The Vergecast
How to train your data

That's one sort of naive thing you can do, but the companies realized pretty early on that

The Vergecast
How to train your data

they need to filter the stuff pretty carefully.

The Vergecast
How to train your data

So the organization you just mentioned, Common Crawl, yeah, they've been crawling the web since the late 2000s, maybe 2009 or something like that.

The Vergecast
How to train your data

And they just make the whole thing available.

The Vergecast
How to train your data

Every month there's a new, they've scraped a few more hundred million web pages.

The Vergecast
How to train your data

And it's available to anyone who wants to do any kind of research with it.

The Vergecast
How to train your data

In fact, it's mostly AI researchers who are using it.

The Vergecast
How to train your data

But all the early...

The Vergecast
How to train your data

large language models were trained on common crawl.