Mark Graham
π€ SpeakerAppearances Over Time
Podcast Appearances
Thousands of local news organizations have shut down in the United States over the last 10 or 15 years, for example. News organizations, media organizations are shut down by governments when they go out of favor. When the failed coup happened in Turkey a few years ago, Wikipedia has documented about 150 media organizations were shut down.
We have a collection of four websites, four news sites from Hong Kong, for example. Apple Daily was one that were shut down for political reasons. In all of those cases, we have really good archives of that material. We have, for example, a full text searchable index of about a million pages from Gawker.
and those four news organizations from hong kong that i mentioned we have built a full text index of the articles from those news sites but there are many many other reasons why a given site may make maybe the hard drives that it that the website was running on crashed Or maybe there was just a change in the content management system.
And when the upgrade was done, the people doing the engineering behind it didn't put in the redirects. And so all those old parts of the site are no longer available. I used to work for NBC News. And I mean, we had more than 100 websites that we were running at one point. And when we were doing upgrades, the last thing we'd be thinking about is the old stuff.
It'd all be like, well, how do we meet the deadline to get the new stuff out?
Many of those conditions are still with us. They're not fundamentally changing. For those reasons, stuff still is going to atrophy. Also, as the web gets older, The older stuff gets older too. People die. The legacy often of an individual's efforts then falls on the heirs or their friends. I can't tell you.
Literally every day here at the Interim Archive, we get communications, principally on emails or DMs or things like that from people saying, Hey, my husband or this organization I worked with, the person has passed away and we're going to shut down the website. We want to make sure that it's preserved. Often we will have already done that. Here's a recent case.
MTV News was shut down and people said, oh, you know, what did you do? Did you have to jump into action and archive it? It's like, no, no. Our work was done. We had been archiving. I mean, if that was what we had to do, then we would have failed because it's too late, right? Our work had been done over the decades.
We call it the Wayback Machine as if it's like a computer that's sitting on somebody's desk. It's actually a whole network of literally hundreds of nodes as part of our overall infrastructure of the Internet Archive of thousands of nodes. more than 100 petabyte of material growing at the rate of more than 60 terabyte a day.
It's a combination of applications that do what's referred to as crawling, which is a process of looking at a URL, looking at a webpage, and then looking at all of the other links, all of the other URLs on that page, and then going to them and then looking at them and then going on and on and on, crawling the web like a spider, metaphorically.
So it's a combination of this crawling and archiving process, as well as the aggregation of all of those archived resources with indexes that makes those discoverable. And then they can be recompiled into web pages. And then patrons, millions of patrons a day come to our sites and they request resources that we have.
Maybe it's a digitized version of a book from archive.org, or maybe it's a archived web page from the Wayback Machine. And then we will present that to them in their browser.
More than that, yeah. Actually, it's something like more than a billion URLs every single day, and that can get pretty quick. It could be like 20,000 URLs a second can be coming into our server. So think of a database that you're writing to 20,000 times a second and you're reading from 5,000 times a second. That's one view into what the Wayback Machine is.
Yes, the heading purchase is always with Seagate and others. We buy a lot of hard drives.
The primary storage medium is spinning disk. I think today we're using 20 terabyte drives. When we started, they were much smaller, of course. Actually, the very, very, very first version of the Wayback Machine, going back almost like 24, 25 years ago, I think we used a tape machine for a little while. But very quickly, our founder, Brewster Kahle, decided that he really wanted
the material that we have to be as accessible as possible to people so that when people wanted something that wasn't like, oh, we have to go back to the stacks and then find it and then get it. He wanted things to be as immediately available as possible. So spinning disks has been the primary format.
And of course, yes, we use a lot of SSDs and a lot of MVMEs and other kind of memory devices for primarily for indexes and caches and things like that.
So first of all, we own and operate our own data centers. They are physically distributed. So when we write something, we're actually writing it to more than one location for physical reliability. It's north of six hard drives a day.
I doubt that. Caves have their own challenges. We're looking at some interesting things. Some of us is in an abandoned coal mine in Norway. We participated with GitHub a few years ago in something called the Arctic GitHub Repository. And we are looking at some more exotic recording formats from some special purpose applications.
But frankly, we think that hard drives are going to be the primary medium that we use for some time into the future. We're constantly evaluating options, but it's a kind of a tried and true and reliable format and process. We know how to handle them. We put them into machines that we rack ourselves and they've been serving us well.