The Internet Archive records its 1 trillionth website

Feb 20, 2026

The Internet Archive—one of cyberspace’s most essential library projects—has achieved a feat that’s hard to even conceptualize. After nearly 30 years of painstaking work, the nonprofit has preserved its trillionth webpage. The moment marks a major moment in the history of digital conservatio n efforts, especially at a time when the internet is both integral to everyday life, as well as increasingly unreliable and difficult to navigate. The internet has a lot of things going for it, but permanency has never been one of them. Digital content is inherently ephemeral, and typically lasts only as long as there is someone willing to maintain its existence. Case in point: In 2019, MySpace (once one of the internet’s most popular early social media websites) announced that an unforeseen server migration error accidentally erased all user uploads to the social and music media website between 2003 and 2015. Overnight, an estimated 50 million songs from 14 million artists vanished into cyberspace. It’s moments like those that the Internet Archive tries to avoid. The organization has sought to create a “permanent record of the internet’s evolution” since 1996, primarily through the use of web crawlers that preserve as many publicly accessible websites they can find. Volunteers also contribute their own uploads, including print releases, hard-to-find music and audio, and other media formats. After almost three decades, the Internet Archive has secured more than 866 billion webpages, 41 million texts, and millions of other forms of digital content. All told, around 500 million new websites added every day totalling an estimated 100,000 terabytes of information so far. That’s the same storage as maxing out 50,000 of the highest-tier iPhones currently on the market. Although the Internet Archive remains indispensable to archivists, journalists, academic researchers, as well as simply curious visitors, it faces increasing pressures from a rapidly changing world wide web. Tech companies racing to train their large language model AI systems are trawling the online landscape for new datasets to consume, often under extremely nebulous legal circumstances. As a result, many major media companies including The New York Times, The Guardian, and USA Today/Gannett are keeping their newer content away from the Archive in a bid to preserve it from generative AI. It’s understandable when there is no concrete framework in place to properly compensate these companies and their writers for their work, but it also makes it much harder to preserve what is arguably the most delicate information ecosystem in human history. Hopefully, all parties will come to an understanding so that the Archive will exist long enough to surpass its two trillionth preservation. The post The Internet Archive records its 1 trillionth website appeared first on Popular Science. ...read more read less

https://www.popsci.com/technology/internet-archive-1-trillionth-website/

Respond, make new discussions, see other discussions and customize your news... Log in.

Respond, make new discussions, see other discussions and customize your news...
Log in.