Alex Reisner
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
But then even at the companies, you know, AI, the AI world is a little bit like academia in that it is good for your career to publish papers.
And the companies don't like that, but they also acknowledge that they have to let the employees publish something.
And so the lawyers will go over it and tell them what they can and can't say.
And over time, they've clamped down more.
And so the companies are revealing less through the research papers.
But yeah, a lot of my research is just from reading a Google paper.
For example, for this last article, they said we trained on tens of millions of songs.
The research papers read very differently now than they did a few years ago.
Mainly, well, I mean, for AI training, why else would it exist?
It's a huge, you know, it's an extremely labor-intensive process to find, you know, in one sense, you just go and download all of Library Genesis or all of Anna's archive.
That's one sort of naive thing you can do, but the companies realized pretty early on that
they need to filter the stuff pretty carefully.
So the organization you just mentioned, Common Crawl, yeah, they've been crawling the web since the late 2000s, maybe 2009 or something like that.
And they just make the whole thing available.
Every month there's a new, they've scraped a few more hundred million web pages.
And it's available to anyone who wants to do any kind of research with it.
In fact, it's mostly AI researchers who are using it.
large language models were trained on common crawl.