38% of webpages that existed in 2013 are no longer accessible a decade later

xbmcuser · on May 18, 2024

I think a bigger problem than 38% of webpages being dead is a lot of it is entities/groups/businesses now use facebook pages almost exclusively and have no other web presence outside of Facebook. In other words a Facebook account becomes a requirement to interact with them.

nicbou · on May 18, 2024

The same happened with forums. They're all subreddits, Facebook groups or Discord chats now. A lot of valuable information is kept hidden in those groups now, and it makes me really sad.

daniel_reetz · on May 18, 2024

I love forums. I've kept the DIY Book Scanner forum online since... 2009? Recently (last two years) these damn AI scrapers have killed PHPBB over and over again. They got me kicked off my shared web hosting plan by abusing search and other forum features.

I upgraded to VPS for $500. The other admin spent 15-20 hours fixing/troubleshooting/transferring. And you know what? At the end of all this, I paid to give my data to these jerks, to keep it online for them to harvest. The forums are dead quiet.

Now I think, Discord is fine. They'll just sell the data to AI companies directly, the burden won't fall on me.

qingcharles · on May 18, 2024

Thank you for your work. It is appreciated :)

grvdrm · on May 18, 2024

I’m with you. I find some of the best info for my cars in two incredible forums: Rennlist, Bimmerpost

matwood · on May 18, 2024

Reddit at least shows up in searches. I also think it's important not to look at the past with rose colored glasses. I think some random forum is much more likely to disappear than a subreddit.

elondaits · on May 18, 2024

I don’t think it’s rose colored glasses. Google saw the value of forums as a source of information when it bought and indexed Deja News’ usenet archive. A lot of pop and early Internet culture resided there. This was then turned into Google Groups, underfunded, targeted and businesses, and more or less buried.

Independent forums (phpBB, and the like) often came up on searches before this communities moved to Facebook Groups, where they’re mostly set to private due to spammers.

Similarly there was a time when Google indexed tweets more or less live, so you could find information for very recent events. I think Twitter asked for money and so that was the end of this.

Now I think Reddit, and maybe Stack Overflow, are the only things helping Google be anything more than an extremely hostile version of the yellow pages. I fear Reddit might at some point withdraw their content from Google and that’ll be the end of it.

data-ottawa · on May 18, 2024

Unfortunately keeping things up to date, secure, and free of spam is a lot of effort. Is very compelling to take your content where eyeballs already are, especially when you can let someone else take care of the hard parts for “free”

malfist · on May 18, 2024

Reddit shows up in searches now. In a decade though? You used to be able to browse Facebook and Twitter without an account but not anymore

wil421 · on May 18, 2024

Car forums are still alive but yea the shift from thread discussions to comment and/or video discussions really kills a lot of knowledge. It’s great to find old forum posts showing you how to work on your car. It’s tiresome skipping through videos to find what you need to or even searching Reddit.

The big thing about discord is you can chat now with people but the knowledge is not in a good format to come back to later.

rchaud · on May 18, 2024

Sports forums too. I find that user stickiness there is pretty good because the Reddit sports subs are simply too large, and have an Eternal September/zero-friction problem, as joining it is a click away.

However, if you managed to find the forum on a search engine and took the trouble to sign up for an account, you are more likely to abide by the general vibe of the place, rather than Redditize it with shallow, meme-y comments that reliably get a lot of upvotes.

skeeter2020 · on May 18, 2024

I think the challenge is finding these rare, valuable places in a sea of noise. Gone are the days that you'd stumble on them, with Google et al keeping you blasting down freeways with no chance of turning into a quiet cul-de-sac where you might see the perfect home for sale, or at least rent it for a while. I still find value and joy on the internet, but it's much rarer and typically the hold-overs from a previous era, or tenuous things like following a handful of YTers while ad blocking still works.

Barrin92 · on May 18, 2024

I'm probably in the minority of people who appreciate that trend. Valuable information being hidden means that community comes before information. If you want to gain access to other people's knowledge you have to opt-in and interact with and understand the people who made it, and that creates an incentive to contribute back and use knowledge in an appropriate way.

The open internet seems increasingly predatory and a place where some gigantic ML company just vacuums up your stuff or resells your content for ad revenue, parasitic.

I don't mind the fact and think it's honestly a natural reaction to this that people guard their information. It's sort of like a medieval monastery version of the internet where people recognize that information is cultivated rather than just some commodity you scrape off the web.

ndiddy · on May 18, 2024

The reason why platforms such as Discord, Facebook, etc don't give open access to your content isn't because they don't want a predatory ML company to vacuum up your content, rather it's because they want to sell access to your content to the predatory ML companies. Meta already trains its own AI on Facebook and Instagram user content. Discord's business model (host an unlimited amount of data for free in perpetuity) is inherently unsustainable, so it's likely they'll start selling user data to ML companies within a few years (assuming the AI boom lasts that long).

disqard · on May 18, 2024

That's a really human-centered view of what we have in forums.

I think many of us (wrongly) have a tech-centered view of online communities -- witness the multitude of "Show HN" posts that are "look! I made an online place for humans to congregate and discuss X".

The tech stack matters little (if at all), but bootstrapping the community's trust and culture (and maintaining both) are most of the heavy lifting, and the differentiator for success.

I recently (re)discovered an HN post by one of the core community moderators of deviantArt, and its success was made possible by its culture:

https://news.ycombinator.com/item?id=13719127

lolive · on May 18, 2024

AI will make all that even worse. Data staying hidden behind nice UX is VERY bad news. May be all that will lead to an equivalent of open-source, but for data.

thro1 · on May 18, 2024

JS and AI are good for context loosers (but not for memex).

Iulioh · on May 18, 2024

Man, i hate it.

Not only a lot of communities are hidden because of Discord (at least with Reddit they were more discoverable), the worst part is the fact that they are unsearchable or behind a paywall.

Like the "join my discord if you pay at least 3$/mo!" is pretty innocent but you are gatekeeping a community that before was pubblic.

If we are talking about something like a content creator focused about an hobby or pc problems you can see how Google will become even more useless.

Reddit was the least bad choice between it and Discord but has failed the "i want to be a social network".

1992spacemovie · on May 18, 2024

[flagged]

ykonstant · on May 18, 2024

I live on 350 Euros a month; my rent is 180 Euros. I literally cannot find 3 bucks to be part of a community. I am not a parasite.

1992spacemovie · on May 19, 2024

I apologize for my above comment. I should have been more thoughtful. You are absolutely not a parasite and do have value.

ITB · on May 18, 2024

You’re a good person

distances · on May 18, 2024

Even the idea of payment to access a community is just absurd. If I'm an integral part of a tight-knit community I can see myself participating in common expenses, but I would never pay for access. At that point you're just a consumer buying a service.

Dalewyn · on May 18, 2024

>Even the idea of payment to access a community is just absurd.

Is it?

If we put aside the common notion that "everything on the internet is and should be free-as-in-beer and fuck you if you disagree", is it really that absurd?

Communities more often than not prefer setting up some kind of filtering to weed out certain people, and a paywall is one of those filters.

ant6n · on May 18, 2024

So every time I have a 1/10th chance to find some useful information on some subreddit, I should pay 3$ for that same chance?

every · on May 18, 2024

I only use Facebook to stay in touch with widely dispersed family members. Nothing else. One peek a day to see what's up. Assuming you have an account, I find this makes the task much easier:

https://www.facebook.com/?filter=friends

KORraN · on May 18, 2024

Thanks for this tip! I've added this to the parameter that I use to have most recent posts at the top: https://www.facebook.com/?sk=h_chr&filter=friends

Log_out_ · on May 18, 2024

And meta keeps things endlessly. Not just a hyper compressed picture and a set of references to local files. That part of the siloed web vanishes too, just less dangly and obvious.

rchaud · on May 18, 2024

Are there any businessses of any notable size that are using Facebook alone? Local businesses near me have plenty of info on Google Maps. The website if they have one is usually out of date, but calling them directly answers my questions.

skeeter2020 · on May 18, 2024

Also 38% of a web filled with diversity, no hidden agenda, and amateurs (in the first best of ways). This number is probably now .00001% of a much bigger, far more homogeneous web. a web 1.0 site > today's walled garden "group page".

pier25 · on May 18, 2024

I've been to restaurants where they only have the menu in digital and uploaded to FB. And they looked at me as if I was a weirdo when I told them I don't use FB.

Brosper · on May 18, 2024

Many times I recommend to my clients to use Facebook instead of their own websites. It was overkill. Often having your own website is a waste of money.

coffeebeqn · on May 18, 2024

I’ve had multiple pages and blogs since 2013 that I just didn’t feel like maintaining or paying the hosting and domain fees for anymore.

carlosjobim · on May 18, 2024

You can see business pages and their info on Facebook without an account. If they publish their email you can also contact them.

dzhiurgis · on May 18, 2024

I get your sentiment, but facebook also acts as a spam filter - not entirely bad thing for business owner

spurgu · on May 18, 2024

From a user perspective Facebook's feed is spam.

You used to be able to see a custom feed of a selected friend lists but since they removed that option the site has been completely unusable, unless perhaps you do something like remove 90% of your "friends" and groups but that would hurt usability in different ways.

xtracto · on May 18, 2024

Ooooh so thats what happened. I just recently restarted using my Facebook account after about 6 years of not really using it. I found it odd that I was only scrolling to android games ads, some diy videos and some rage-motivated generic posts from accounts I don't know...

I liked more the Facebook that showed me the humblebrag posts of my friends/connections (and I'm not being sarcastic)

ant6n · on May 18, 2024

That was like 20 years ago. People have been sad about Facebook being impersonal for more than 10 years.

spurgu · on May 18, 2024

It was very usable with custom lists up until recently. Their help pages still reference the ability to browse updates from custom friend lists (at least when I checked a couple of weeks ago) but the actual feature has been dead for a while now. Guess they didn't like that people were able minimize pointless engagement and doomscrolling.

kayodelycaon · on May 18, 2024

It really is. I’m following two groups and a handful of people. I never see posts from any of them and it’s difficult for me as a software engineer to navigate the site.

relativ575 · on May 18, 2024

> From a user perspective Facebook's feed is spam.

The topic is FB groups. They aren't spam, at least for those I'm a member. Some groups may be quiet, some are active, but I don't recall coming across spam posts from any of them. A particular group has a rule that members can promote their business once a week, enforced by the group's admin

soulofmischief · on May 18, 2024

Works both ways, too!

If a business is only on Facebook, I don't do business with them as I don't use Facebook.

A win-win in my book, as I prefer doing business with people whose ethics overlap with my own.

vundercind · on May 18, 2024

Maybe it’s different where you are, but around here that filter would mean I could almost only patronize large chains. Small businesses have Facebook or maybe insta (which is much worse, Facebook business pages grant far more access to a non-logged-in user) and no website. Restaurants might have a barely-updated website (the updates are on Facebook) that links to some third party ordering service, maybe.

soulofmischief · on May 18, 2024

I definitely get your point.

But it's clear that continuing to use Facebook in that manner will only strengthen the isolation effect. Voting with your wallet and going against the machine invariably involves some level of personal sacrifice. For me, sacrificing patronage is incredibly easy to do. There is more to life than commercialization.

My girlfriend says she only uses Facebook to interface with small stores, who use it as a sole point of contact or distribution. Let that sink in for a moment. Breaking this cycle will require hard work.

dazc · on May 18, 2024

This as well, follow us on 'insta' and if you don't know what that means then we have no interest in doing business with you.

crtasm · on May 18, 2024

I suspect people just don't stop to think hey - perhaps not everyone is/can be/wants to be on Instagram.

vundercind · on May 18, 2024

I suspect Instagram or Facebook gets them 10x the eyeballs of having a website, at 1/20th the effort, zero cost, and nearly zero skill or expertise at anything tech-related.

crtasm · on May 18, 2024

I suspect both can be true at the same time. In the case of Instagram it still seems silly to miss out on potential customers by only posting on a non-public platform though.

vundercind · on May 18, 2024

Facebook definitely makes more sense to me. It only stops me if I try to go browse back through all their photos or something. I can look at posts and any of their… I dunno how Facebook works, but featured or whatever images, for menus or current sales or what have you, no problem. Insta stops me if I try to scroll past the first screenfull of content, and doesn’t have as much info available outside of posts (most of which I can’t see)

I have avoided a place with only insta, simply because i couldn’t see anything I needed to.

dazc · on May 18, 2024

Many small businesses live on a shoe-string though and the cost of developing and maintaining a website is prohibitive. Their self administered facebook page isn't anything to write home about and, likely, generates zero business but it is free so long as they resist the temptation to boost posts and have an extra 3 people see them for only $36.

soulofmischief · on May 18, 2024

Sounds like an opportunity for someone to build a better, more open and interoperable platform.

Zambyte · on May 18, 2024

Funnily enough I specifically don't use Facebook or other Facebook owned services because of all of the spam.

elorant · on May 18, 2024

If you advertise through Facebook you get a lot of fraudulent traffic. So I don’t see how they fight spam.

detourdog · on May 18, 2024

Also acts as a customer filter for us old fogeys.

dazc · on May 18, 2024

I thought facebook was all 'old fogeys' now?

iamacyborg · on May 18, 2024

You’re not really running a business if all your content is on someone else’s platform and they don’t pay you for it.

philistine · on May 18, 2024

I’m doing my part. The non-profit I steer only had a Facebook page. I made them a website.

amanzi · on May 18, 2024

Some of the better websites at least make an effort to archive old content. e.g. here's CNN and BBC websites with coverage from the 9/11 attacks:

http://news.bbc.co.uk/hi/english/static/in_depth/americas/20...

http://edition.cnn.com/SPECIALS/2001/trade.center/index.html

Don't expect many of the links to work properly, but it's still interesting to see what the web used to look like.

mhh__ · on May 18, 2024

Some of the interactive stuff on old BBC election coverages still almost work to this day.

Hard to imagine that with many sites now 20 years on. It's not even that it;s impossible with the technology, it's probably closer to how writing got worse after the invention of the word processor. Every thing is managed and structured now so the freedom / bubble needed to make things good in a way that can't be easily explained is gone.

squarefoot · on May 18, 2024

Be sure to donate some quid to the Internet Archive (archive.org) to support their efforts to preserve (not just) old content, then do your best to make local copies of anything you find of value, just in case they disappear one day. A good number of mostly technical pages I have in my bookmarks file, that grew steadily and has been moved during installations for over 20 years, now point to their latest complete backup before the said page went silent. The Internet Archive is a huge boon to everyone.

massysett · on May 18, 2024

I realized I was overusing bookmarks. I now save webpages (perhaps as PDF) if it contains information I want to refer to later, such as an insightful article, technical information, a humorous bit, or the like.

Bookmarks are good only for links to things for which only the most current version is worth accessing. That’s my banking websites, a shopping site, my employer’s remote desktop system, etc.

dewey · on May 18, 2024

There's also https://archivebox.io which can take your bookmarks and archive them in many ways. Unfortunately back when I tried it last time it was a big buggy, I wish there was a better solution to build a nice archive of the sites I visit more often just in case.

zimpenfish · on May 18, 2024

I found the Chrome it uses for fetching needs a kick every couple of days. That definitely improved my experience.

ByThyGrace · on May 18, 2024

On that same vein, shoutout to Epub Press[0]: a browser extension/web service that packs selected open tabs into a neat conformant .epub file.

0: https://epub.press

rchaud · on May 18, 2024

I save webpages as PDF because they retain the images and fonts of the original page. One issue I run into is that sticky headers/footers used on websites often obscure top/bottom text of the page when exported to PDF. This can be addressed by using UBO to remove the sticky DOM elements before saving, but it's a bit of a hassle.

dontdieych · on May 19, 2024

singlefile browser extension is quite good in this regard

toomuchtodo · on May 19, 2024

Others have recommended ArchiveBox, I will recommend using any bookmarking tool that fires off a web request to the Wayback Machine to archive a page when you create the bookmark.

brokenmachine · on May 20, 2024

I have a bookmarklet that points to:

  javascript:(function(){daurl='https://web.archive.org/save/';if(location.href.indexOf('http')!=0){input=prompt('URL:','http://');if(input!=null){location.href=daurl+input}}else{location.href=daurl+location.href;}})();

This has the keyword "way" in my Firefox bookmark.

So when I'm on a page I want to save, I just press Ctrl-L (to focus the URL bar), then type "way" to save the page on Wayback Machine.

amelius · on May 18, 2024

This is something a browser could in theory do for you.

dotancohen · on May 18, 2024

How much disk space did that consume?

I like the idea that in addition to saving the page, you can annotate it as well.

massysett · on May 18, 2024

Not a noticeable amount in this age when even expensive SSD storage has multiple gigabytes. Even pages with multiple images just aren’t that big on a typical hard drive.

dotancohen · on May 18, 2024

But I'm still backing up offsite and have to consider that as well.

astrostl · on May 19, 2024

I wish the Internet Archive would split itself into two entities: one that simply archives web sites, and the other that does everything else (e.g., edgy IP testing of ebooks and video games). That way if the "other" entity gets sued into oblivion, the web sites remain. I think what the former is doing is a critical service for humankind, and I do donate, but I worry about their future.

toomuchtodo · on May 19, 2024

Don't worry.

earthboundkid · on May 18, 2024

I have run a news website since 2019. Every hour, I have a crawler look for dead links. I replace about one link a day with a link to archive.org. The funniest ones are the day after an election when all the candidate websites go blank. The saddest are the government websites that go offline from 3am to 5am every week.

notRobot · on May 18, 2024

Interesting, does your crawler check every link every hour, or does it go through them a batch at a time?

earthboundkid · on May 20, 2024

Every link every hour. https://github.com/spotlightpa/linkrot

onion2k · on May 18, 2024

I'm surprised it's not more. 2013 was long after the days of hobbyist websites of the early net, and into the time when most new sites were business driven. Given how long businesses last I'd expect many more sites to be long gone 11 years later. I guess maybe the death of a lot of community-building spaces (angelfire, Geocities, etc) probably counts for a lot of them going.

What would be particularly interesting would be to graph how long websites last for. I suspect quite a lot of the content from the early days is still around, and this period (2008 - 2018) is the peak of sites vanishing.

rchaud · on May 18, 2024

A lot of the content from the early days was on platforms which are long dead:

- Geocities

- University-provided FTP folder (deleted after you graduate)

- ISP-provided FTP folder (all those Earthlink, Juno, Comcast sites: probably deleted)

lagniappe · on May 18, 2024

I hope not all things last forever. A while back I stumbled upon my first .com, from the 90s, which was hosted on Angelfire and dutifully rehosted by archive.org and it went about how you'd imagine.

Despite being in 4th grade when my little friend and I made the webpage, things on there (while fine for the era) are just not okay by today's standards even if I understand the context for what led to it being there. It was nothing terrible, but just distasteful in a blissfully unaware way a 4th grader in the 90's would be. I realize that stuff will probably never be off my conscience and I just have to deal with it and hope nobody sees it.

otachack · on May 18, 2024

I have similar material. If it's reassuring, we all were just kids/teens and learning of the world. I feel a lot for the youth after us that made the Internet more accessible and, at times, more permanent.

BirAdam · on May 18, 2024

I feel your pain.

Thankfully, even the archive occasionally takes stuff off.

zokier · on May 18, 2024

Everything on internet is intrisically ephemeral. Embrace that instead of fighting against it. If you want to archive stuff then make offline copies. PDF/A (especially the -1 and -2 versions) is format explicitly designed for archiving and works well for static content.

I think it is bit of a shame that mirroring is not more readily built into web stack (=http/html); if you could trivially make links that included local copy (as fallback?) this linkrot would be far lesser concern. The way how for example wikipedia links everything through archive.org is bit of a hack imho

badgersnake · on May 18, 2024

I’m surprised it’s that low to be honest. Most of the web seems to by SEO crap these days.

brabel · on May 18, 2024

Agree. Sometimes you just experiment with something, put up a tiny website somewhere... forget about it until you decide it's no longer relevant for whatever reason and you pull the plug on it... it's not a bad thing. But it's great to have stuff like web archives though, to keep our collective memory for worthwhile content. I specially hope that accurate accounts of events gets preserved, as it was originally written, somewhere it can't be changed. That's because rewriting history seems to be a favourite these days and preserving the original accounts as things were happening can help combat this, and even if the account were not completely accurate, it can help understand the actions of contemporary actors - i.e. you may be able to understand what they thought was true at the time, even if that was later revealed to be incorrect.

nicbou · on May 18, 2024

Some things still exist but are just no longer surfaced by Google.

eigenvalue · on May 18, 2024

I view this as a serious failing of the internet that we collectively should have done a better job of avoiding. In most cases I believe the content itself is in fact still available somewhere and it’s simply the link that broke. Some kind of two layer system like the DOI system used for libraries would be helpful for cases like that:

https://nuim.libguides.com/referencing/DigitalObjectIdentifi...

But it would need to handled automatically to maintain the full usefulness and convenience of URLs. Not sure how that could work though.

ivan_gammel · on May 18, 2024

This is a feature, not a bug. It would be a terrible life in a world that does not forget or forgive. It’s also good that some preservation effort is necessary for worthy content: the value of it gets more appreciated.

ants_everywhere · on May 18, 2024

> It would be a terrible life in a world that does not forget or forgive

This is an orthogonal concern, and arguably is mainly about privacy

> It’s also good that some preservation effort is necessary for worthy content: the value of it gets more appreciated.

This same argument seems to imply that virtually everything should be expensive. Cheap storage is bad because we don't appreciate the value of the files we store. Expensive healthcare is good because it really makes us appreciate our organs.

> worthy content

The hard part is looking into the future to determine which content will be considered worthy then. So far no human civilization has managed to figure that out. They mostly seemed to focus on preserving the image of how amazing their kings were.

maykef · on May 18, 2024

Simple. Store everything. I give you an example: the clay tablets written in cuneiform discovered at Ur. They were disposed of by Sumerians, possibly because they have fulfilled their purpose and were simply thrown away. They deal with important things like commercial transactions, but also unimportant things like a personal letter or a poem. This unimportant things pretty much taught all we know about the Sumerian language: syntax, vocabulary, regional variations, etc. In archaeological terms, a refuse is a huge treasure trove, precisely because no one chose what to throw away. Everything is there. It's entirely up to the archaeologist to comb and come up with an interpretation.

ivan_gammel · on May 18, 2024

>This same argument seems to imply that virtually everything should be expensive

Non-zero doesn’t mean “unaffordable” or “expensive”.

> The hard part is looking into the future…

The hardest part is to understand that the content we want to preserve carries more valuable information about us than about itself.

Scientific knowledge can be discovered again, it’s not something to worry about. The preservation shapes the future views of us, leaving the trace in the history of those who preserve, their life and their experiences. Maybe they just needed to accept their mortality and irreversible flow of time?

ants_everywhere · on May 18, 2024

I don't know. Scientific knowledge is expensive to rediscover. It requires a lot of false starts and often involves a great deal of luck/randomness. Historically, periods of flourishing are often associated with the rediscovery or importing of large collections of wisdom. For example revivals of ancient works or the introduction of works from afar due to newly developed trade routes. If ideas were easy to rediscover, we shouldn't expect those events to have much impact.

There is certainly a cost of storing data, and cost should enter the equation. But we're losing a lot of data for reasons other than cost and we don't have a reasonable way of assigning a value to the lost data.

ivan_gammel · on May 18, 2024

My point is, we are engaging in a very unnatural process, trying to preserve something against the second law of thermodynamics. We are going to lose the data and things are going to break no matter what. We cannot change the nature, but we can accept it.

hiatus · on May 18, 2024

> We are going to lose the data and things are going to break no matter what. We cannot change the nature, but we can accept it.

If man was content with the nature of things he would never fly, or go to the moon, or any of the other myriad accomplishments humanity has made. If we can preserve clay tablets from thousands of years ago we can find some way to keep the information we produce today for posterity.

ivan_gammel · on May 18, 2024

I’m not arguing with that. We can and we will preserve some information. We must not be obsessed with saving everything. Imagine the humankind in a million years from now; if we pick 10 most important facts about each century, that will be a hundred thousand facts to remember. Maybe we develop abilities to know and use them all, but from a modern human perspective that’s already too much. Now, can you reduce the XXI century to 10 facts? How much of those zettabytes of information would be worth keeping for a million years? For ten thousand years? For a thousand years?

fallingsquirrel · on May 18, 2024

I can't imagine an archaeologist or historian studying e.g. Pierre de Fermat ever saying, "Good thing we have so few documents and artifacts! That way we know what was really important to him."

What's the proof for Fermat's Last Theorem again? Doesn't matter, it was just a footnote anyway so let's not bother preserving it. It doesn't matter that it took our smartest minds 358 years of trail and error to rediscover the proof. It can always be discovered again.

ivan_gammel · on May 18, 2024

We don’t know if the proof did exist. The complexity of the one we have is an indication that he did not tell the truth, whether that was a genuine mistake, a joke or something else. For sure it wasn’t the longest time gap between a problem statement and a solution.

Yes, rebuilding civilization from scratch would be a difficult task, taking centuries if not millennia, if no knowledge is preserved. However we do preserve it and do spend considerable effort, what cannot be said about our culture and individual experiences.

fallingsquirrel · on May 18, 2024

Maybe the complexity of the one we have is an indication that we haven't rediscovered it. We discovered a worse one. Maybe it will take our smartest minds another 350 years to rediscover his. If preserving data was easier in the 1600s we could just grep through Fermat's hard drive and we would know for sure!

Let's keep making preservation easier, and preserving as much as we can. Maybe much of it is worthless, but I guarantee there's at least one document we think is worthless today, that historians 500 years from now will be glad we preserved anyway.

detourdog · on May 18, 2024

The day Apple bought NeXT somewhere on the internet I read a very funny post that started.

Bow down to Unix children of Macintosh...

The whole post kept the same biblical style while describing why the Mac was conquered buy NexT.

A really great post that every once in a while I try to find on the internet.

Hard to say what is lost when it is unknown.

onion2k · on May 18, 2024

Hard to say what is lost when it is unknown.

Is it possible to lose something you never knew?

Wololooo · on May 18, 2024

Yes.

But usually you realise after the fact learning that it was when you could have interacted with it.

Those things exist then through legends and the collective that still remembers, though that collective is very prone to just misremember.

detourdog · on May 18, 2024

That is a good question and I'm struck in the English language between lost and gone.

If it's just lost I might still find it. I believe the post I'm looking for is lost. I don't know if it's gone. If it's gone I will never find it.

Could be it if it's lost there must be knowledge of it?

ajmurmann · on May 18, 2024

"we've lost our unborn child"

"Have you checked all the drawers?"

maykef · on May 18, 2024

Where do you think we have lived the last 5 thousand years? We have clay tablets written in cuneiform that were excavated in Ur from a refuse, and thanks to those we know the little we know about Sumer. The invention of writing made the exercise of forgetting impossible. This has been thoroughly studied by anthropologists like Jack Goody, James Carey, David Oslon, Barry Powell and some other writers like Walter Ong. We live in fact in a terrible world that is mostly trap in the past, where cultural complexity grows in onion layers. Anyone can go back to the past and yearn for it. We can always go back to the past through our stored knowledge, but that past will mean different things to different people as they have not experienced it. Since the invention of the printing press we have lived in a constant state of information inflation. Middle Ages scholars used to complain that with the printing press anyone could read and write books, scholastics were scandalised by the rise in the vernaculars, Michelangelo complained about Flemish painters and their vacuous form of art and so on. What is worth mentioning here is the rate at which decay is occurring. The articles mentions that 38% of sites that existed in 2013 are no more; that's a decade. How much of that is noise and how much of that is useful information, or at the very least "interesting" content, we don't know. It's gone. How much of that info has been saved by the large web scrappers, or how much is stored by google or twitter is also unknown to us. What do you define worthy content? A tweet with a million views even though is just an actress semi-naked? A tweet with 300 views about breaking discovery? We celebrated like there was no tomorrow when the internet brought down the gatekeepers, those newspaper, books, magazines, tv and radio editors; just to get swamped in noise, conspiracy theories, memes, tik tok and so on. The problem is that we can't barely cope with the huge amounts of information that is thrown at us and we are too many, with too different tastes to even agree what's worth and what's not. The "feature" as you've called it, may be by design, but it doesn't mean is useful or morally correct.

lannisterstark · on May 18, 2024

Paragraphs are a wonderful thing, friend.

ivan_gammel · on May 18, 2024

> What do you define worthy content?

It is highly subjective. I’m very curious about the past, but I don’t care if nobody will know the name of Newton or Mandela in 10000 years, but some YouTube blogger will somehow be a legend.

> morally correct

How can enthropy be morally correct or incorrect?

maykef · on May 18, 2024

> How can enthropy be morally correct or incorrect?

You said the disappearance of content was a feature, not a bug. If it is a feature it was designed. I understood your comment as implying that somebody created this feature.

You now speak about entropy, which one? Boltzmann's or Shanon? This doesn't have to do anything with bit rot or the like. When you write a book, you cannot unwrite it. Is a fait accompli. But if I create a website and I load a bunch of content and after few years I don't pay domain, server space, etc. it will be deleted; at that point a webcrawler may have had copied all the info, or not, we do not know, or somebody may have thought it was worth it and saved a link to it. At a fundamental level, who decides what stays and what is deleted, who owns it if the webcrawler stored it without your permission? These are all moral questions, not technical ones...

ivan_gammel · on May 18, 2024

> You said the disappearance of content was a feature, not a bug. If it is a feature it was designed.

It wasn’t designed, it’s just a very common metaphor about the perception of things rather than the way they came to life. It means that we should embrace it instead of trying to fix it.

> You now speak about entropy, which one?

Boltzmann. A system where information is preserved forever through the arrangement of energy states is highly improbable, so regardless of individual moral choices and the effort it will fall apart by the laws of nature.

Springtime · on May 18, 2024

The predominant sentiment I'm seeing in comments here is a 'what does it matter?' While at the same time in discussions here about search engine result quality inevitably the most popular comments express a decline in quality (influenced by SEO, AI, spam, among other things), along with a desire for surfacing interesting, human made content that gets buried by modern search algorithms.

On HN we see every day interesting, first page content marked with a title representing its year, sometimes dating back a decade or more. Its age isn't apparently a detracting aspect to this audience if the content is still worth sharing.

And from the article the headline figure also doesn't represent irrelevant/undesirable content either but Wikipedia references, news articles, government pages, along with less unexpectedly ephemeral things like Twitter posts.

We're lucky there is archive.org but since it's not indexed like a regular search engine the only tether to old pages are still-live links found via regular search engines/sites (including HN). Essentially unless sites continue to exist that contain links to archived content the chances of future discovery becomes slim.

My stance is if you find something interesting that you expect is worthwhile sharing try saving it in the most convenient single file page format available to you (MHTML, SingleFile, PDF), to have your own copy. For MHTML at least it also saves the original page URL in its metadata. Saving to online archives is also great but admittedly higher friction (and can sometimes result in things like IP restrictions on archive.org even when saving just a handful of pages in a row, ime).

Kye · on May 18, 2024

The neat thing about mhtml is it's plain old MIME. You can change the extension to eml and open it in email clients.

superkuh · on May 18, 2024

This is because the majority of webpages are on commercial (for profit) sites now and for-profit companies do not build anything that lasts. Part of this is their use case requirements: CA TLS has to be the only way to access the sites. And since CA TLS is extremely fragile and short lived, so are the sites. But additionally, any dynamic site also has a short lifetime. HTML files-in-folders sites will last till the heat death of the universe unmaintained. A php or nodejs/etc site will last a year or two after it stops being mantained.

The core of the web pages on the internet are still there. It's just that the thick layer of commercial cr'app' websites built up on top are transient.

KronisLV · on May 18, 2024

From the point of actually trying to keep something online, it's basically constantly having to fight against code rot.

You don't update your server or database or runtime/framework/library? You'll get hacked and will drown in CVEs. You do try to do these updates? Have fun rewriting bits of your code, because the old version of a framework/library is no longer supported and there are breaking changes, which mean needing a partial rewrite.

Your best bet around that might be one of the relatively stable databases like SQLite, a micro-framework on the back end for a RESTful API, a simple solution for auth like basicauth/mTLS/... at the web server level in front of your API and then something without a toolchain on the front end, like jQuery. I mean this unironically, unless you want to maintain very few sites, then you probably have a bit more time on your hands.

Feels to me like the only content that can have any sort of longevity without constant investment of time is static sites - where updating your web server or moving to a different one is trivial and there are no write operations involved in most of the processes (maybe setup logrotate or just delete the logs occasionally).

jonnycomputer · on May 18, 2024

One of the reasons I moved my personal site to static. Don't have to think about it anymore.

SequoiaHope · on May 18, 2024

Yah I’m thinking about the Space Jam website. HTML doesn’t seem to rot like some hot new JS library does.

bilalq · on May 18, 2024

It's an issue of servers, not JS. A BackboneJS app written in CoffeeScript back in 2010 would still run just fine today if it was hosted on S3/CloudFront. Replace the framework/language/year with anything, and it'd still hold true.

But if the page needed to fetch data from an unmaintained API server that ran out of disk space, lost its DB network connection, got rebooted by a VPS provider, or any other issue, that site will probably never work again.

jonnycomputer · on May 18, 2024

Yeah, its mostly a matter of database/server exploits. My personal wordpress site got white-hacked; left a friendly note telling me to update. I switched to static instead.

Kye · on May 18, 2024

HTML benefits from the fact that browsers have always bent over backwards to make sense of whatever freak noise they're fed. Try the same with your average compiler/interpreter and it'll send you right back to your desk with a list of admonishments. Trouble is, those admonishments change, and require updates. The HTML inside a PHP program from 2005 probably works, but the PHP probably doesn't.

rchaud · on May 18, 2024

I've been making websites for > 10 years, none of them have been hacked. I don't do Wordpress anymore, so that may have something to do with it.

I drag and drop a folder of PHP files underpinning a folder of Markdown file content, and voila: a website. Works online and offline (local server).

ghaff · on May 18, 2024

Some sites that I (and some coworkers) wrote for were basically frozen fairly recently. I told everyone they should make their own copies of anything they care about because I’d put money on a few years from now the sites needing some maintenance work to keep them secure etc. At that point I’m pretty sure that the person responsible wouldn’t spend the resources at to patch things up and will just shut off the sites.

jacooper · on May 18, 2024

Depending on the website, you can archive it as a static site and just host it for free on something like cloudflare pages.

williamcotton · on May 18, 2024

The unacknowledged costs of using free third-party software eventually proves the law of there being no free lunches!

falcor84 · on May 18, 2024

What does it have to do with that software being free? I haven't found proprietary software to be more reliable or easier to maintain integration with.

williamcotton · on May 18, 2024

Have fun rewriting bits of your code, because the old version of a framework/library is no longer supported and there are breaking changes, which mean needing a partial rewrite.

You don't get to choose if a third-party decides to rewrite an API interface, deprecate an entire library, etc.

You know that proprietary code has a cost because you pay for it. "Free" software is added to projects without much thought of what happens down the road.

barfbagginus · on May 18, 2024

I saw meme beams glitter past Ebaums Gate on the arm of Orion. These moments will be forgotten, like Cheezburgers in the rain. Prepare to die.

gregoriol · on May 18, 2024

How many % of brick and mortar shops from 2013 are still in business today? How many % of people who were alive in 2013 are still alive today?

Kye · on May 18, 2024

Let's preserve those, too.

There was a really nice music school/venue/coffee shop in the town I used to live in. It shut down, and they took their Facebook and Instagram pages down. The only evidence it ever existed is in posts announcing events on other Facebook pages, in memories of people who went there, and on still surviving business listings that are likely to go away.

vaylian · on May 18, 2024

Something I've been wondering about for some time:

Aaron Swartz's website is still online more than 10 years after his suicide: http://www.aaronsw.com/

Someone has to keep paying the bills and renewing the domain. Still no HTTPS though. Does anyone know how this is handled?

Kye · on May 18, 2024

>> "Server information sidebar: This site is being served from an Ubuntu box with 2GB of RAM. The server is currently provided by several people"

This is either added since his death, and it's maintained by supporters, or it was there to begin with and one of the several people took over maintenance and funding.

But that still leaves the question of who. Maybe they want to remain anonymous.

throwaway290 · on May 18, 2024

This will get only worse as people realize that almost anything published is fed into ML products and used against them.

furyofantares · on May 18, 2024

While I am aware that nothing is permanent and everything takes maintenance, I sure feel like there ought to be a bit _more_ permanent way to publish things.

I run a few daily word games. They're static sites that could continue forever as long as they have players, but sometimes I think about what would happen if I died or stopped paying attention. Domains would expire. Maybe some people could still play cached versions, and archive site versions would still exist.

I host them on github pages so if I exposed that url then that could probably exist for as long as github does - or breaks something that requires the owner to click a button to fix or something.

I could probably publish a free desktop version on various stores, and that might last as long as the store, or OS upgrades break it.

It would be great if there was a way to publish something such that it exists as long as anyone cares about it.

mikepurvis · on May 18, 2024

I bet GitHub pages is more permanent than the iOS store ecosystem.

theginger · on May 18, 2024

This is very glass is half empty. Why isn't this, 62% of websites from 2013 are still available 11 years later!

wiseowise · on May 18, 2024

Because this isn’t good news at all.

Zambyte · on May 18, 2024

That still doesn't sound very good to me. My mind quite effortlessly inserts "only" before 62%.

psychoslave · on May 19, 2024

I would be interested to know how it compares with printed books for example.

robertlagrant · on May 18, 2024

> It should be a standard for the open internet, that any respectable page has an "export complete archives" as a clickable button. (But it's mainly the opposite, today: adtech corporations don't want your written works to be free, they want them incarcerated in their revenue-generating walled gardens—and if ten billion human's worth of written history gets erased at the end, well, too bad for them!) [0]

Can't you export your data from social media platforms? I haven't tried it, but did Google this for Facebook[1] and it looks as though you can?

[0] a now-deleted comment

[1] https://www.techlicious.com/how-to/how-to-download-all-your-...

MattGaiser · on May 18, 2024

I’ve exported my data on Facebook before.

benwillies · on May 19, 2024

I remember people saying that once it is on the web it is there forever. But my experience is how fast web sites disappear. Or get redirected to a nonsensical site in the orient. I think maintaining the domain name is the big problem.

However I am proud to say that you can still see my very first published web page from 1995 if you know the rather obscure url... http://admin.benwillies.com/ticker/

I wrote this page as a proof of concept for a friend of mine who was a financial consultant but unfortunately the humor was a turn-off instead of getting him excited about this new fangled thing called the Web.

I never made that mistake again.

b112 · on May 18, 2024

If one looks at the graph, 20% of sites were dead in 3 years.

This isn't a facebook thing, most facebook migration happened a decade ago. Instead this is companies closing, campaign websites shuttering, urls being changed, community events being over, and so on.

nimajneb · on May 18, 2024

I've been thinking about this for a couple years. With our (recent) history becoming more and more digital we are losing more of our history. There's lot of creativity of all forms being digital and online only lasting only as long as the creator supports it. No longer are we viewing physical pieces art, but ephemeral pieces art. (I'm including websites made for personal use, by hobbyists, etc).

I know Archive.org archives a lot, but they can't archive everything, especially all the small personal websites.

BirAdam · on May 18, 2024

Not just personal sites, but even news sites.

Storage is expensive. Site owners don’t want to hold stuff forever, and the Archive cannot afford to hold everything forever either.

Ferret7446 · on May 18, 2024

A natural consequence of federation. You could attempt to archive everything and host it. But ultimately, each host is responsible for what it hosts and keeping itself available.

detourdog · on May 18, 2024

That is what I like about federation. All the incentives are individualized. I think the scale that was needed for business in the last century has disappeared. Businesses previous needed scale and reach but for an individual to make a decent living in a niche is very easy today.

ekianjo · on May 18, 2024

> Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed

this seems extraordinarily high. since they cant do that estimation on all tweets ever sent it must be a bias of their sample

shrubble · on May 18, 2024

Not surprising - even HP's support site is full of dead /broken links, you can't even locate information on 2013 era desktops and servers.

markx2 · on May 18, 2024

I blogged from 2000 onward but hit my stride with Wordpress in 2004. I posted thousands of times. I had a Google PR on 8 at one point but it tended to be 6/7. In 2020 I decided to shutter the site. The database with the content is still there, it's just inaccessible.

Google, Bing, Archive.org and others I contacted and had them remove my content.

Then I removed everything visible.

Very very happy I did, especially with the rise of AI.

bdcravens · on May 18, 2024

Not a problem. The web is a communication medium first and foremost, and part of the freedom of communication is the freedom to de-communicate.

martindbp · on May 18, 2024

I have one of those catalogs from 1996 with an index of web pages for different topics. Maybe I should scan it and do the same experiment

epolanski · on May 18, 2024

I often look previous years posts when there is nothing I can find interesting for the day on HN.

Go back few years already and most links are dead.

agumonkey · on May 18, 2024

How much of pages today are not worth accessing ?

Brosper · on May 18, 2024

It's like with restaurants. A lot of them disappeared after the pandemic. Web hosting is not free.

rchaud · on May 18, 2024

Majority of Wordpress.com and Blogger.com sites are on the free tier.

Brosper · on May 18, 2024

OK, so domains are not free. But many businesses had their own hosting and domain.

daniel31x13 · on May 18, 2024

This is one of the main reasons I created Linkwarden - to combat Link-Rot.

Linkwarden is an open-source collaborative bookmark manager to collect, organize and preserve webpages:

https://linkwarden.app

ksec · on May 18, 2024

One of the reasons why Bookmarks are useless. Most of them are dead, and those that are not can not easily be searched within Google.

The web we have today is just worst than pre-Google dominant era.

k__ · on May 18, 2024

Arweave tries to tackle this issue.

https://arwiki.wiki/#/en/the-permaweb

bryanrasmussen · on May 18, 2024

Yeah, I've noticed this - I made a funny little art project in 2013 wordclouding parodies of This is Just to Say (and later other famous poems and literary quotes that often get parodied) because wordclouding is the least interesting kind of visualization there is:

https://medium.com/luminasticity/fruit-stealing-scoundrel-ha...

and quite a lot of the content I used is no longer found, so all that remains is the word cloud but not the original poem.

It's a bit worse than the article details though because often the web page is there, or if not is available from IA, but much of the content was actually inside of comments and so forth.

(project was originally inspired because back then you couldn't go anywhere on the web without someone hitting you over the head with their This is Just to Say parodies)

JoeAltmaier · on May 18, 2024

In 2013 there were maybe 630M websites.

Today there are over a billion. So those 38% amount to 22% of today's internet

Seems like a pretty good half-life!

PaulStatezny · on May 18, 2024

This is a good thing. In an age of information overload, an Internet that never contracts sounds overwhelming.

axegon_ · on May 18, 2024

Honestly, seeing startups die and people abandoning their blogs(of which I'm also guilty, despite my best intentions, I just happen to pay my domains for 10 years at a time), I would have expected the number to be much higher.

runamuck · on May 18, 2024

I post my site content Markdown to an open Git repo for this reason. Anyone can pull and build my pages. I think Git should stay for at least another 100 years. https://github.com/hatdropper1977/john.soban.ski

EasyMark · on May 18, 2024

That's a much lower number than I would have suspected after 10+ years.

renegat0x0 · on May 18, 2024

Attention is limited. We cannot see everything on the Internet. We do not have enough time for that.

There is a lot of valuable and interesting data on the Internet, but it is not visible. Certainly high quality, low profile blog that ended its development in 2015 will not be ranked high in Google.

Media platform, search engines monetize content. YouTube channels need to churn new content every week or so to stay relevant and to stay watchable.

Our society produces content, not quality, not products.

SEO can be gamed, it is impossible to create objective index of valuable content. Bad actors will hack the game, spam results, destroy quality to gain profit.

Google search engine most often connects users with media sites, with news sites, with the middle men. The more often not connect users with product directly. Write "search engine" in search query, you may not only find search a "search engines" but articles about "Best search engines in 2024", or "best SEO tricks to boost your page".

Google does not have any incentive to fix this. Search engines are dead tech. It will be replaced by chatbots in a few years. People will not search for content, content will be generated at wish.

Some time ago I have created my own domain repository with domain names: https://github.com/rumca-js/Internet-Places-Database

I wanted to find "wargames" related pages. It is quite impossible to find anything interesting concerning warhammer on the normie internet (not Facebook).

The second thing is I cannot find anything "amiga" related.

This solved this my initial problem. I have also found out that many interesting pages are gone. I think that Google directing our attention toward "content" broke good quality pages.

Right now I am using less and less google, because I use more and more my bookmark manager.

https://github.com/rumca-js/Django-link-archive

My solutions may not be as complex as common crawl, but they are enough for me. For now. I am still working on my program. It has been fun and interesting experience for me, and I learned a lot. About open graph protocol, about schema, about web scraping, etc. etc. Maybe this will inspire people to be more self sufficient, and more self-hostable.

In times of walled gardens we need more standard, and more open data to keep what remains of the old wild west of the Internet.

frankthepickle · on May 18, 2024

are we saying we're going to keep making hard drives so we can save everything ever produced? I see the value in many things but I worry about the load of expecting everything to be saved forever.

iAkashPaul · on May 18, 2024

Not being able to renew domains beyond 10yrs could also be a factor

pif · on May 18, 2024

What has humankind lost with those pages vanishing?

nicbou · on May 18, 2024

In some cases, a lot of valuable information that doesn't exist anywhere else. A big German immigration forum vanished this winter. There was a lot of valuable information for people navigating tricky bureaucratic processes.

Bostonian · on May 18, 2024

If you see useful computer code on someone's site, do you need permission to mirror it publicly on GitHub with attribution and the same license?

Vespasian · on May 18, 2024

Im am not a lawyer but as far as I know ,it depends on you jurisdiction and the exact code in question.

Many legal systems don't know "fair use" and by default you have effectively zero rights to do anything with copyrighted materials without explicit permission.

The license will tell you what is allowed (and If its a standard one you can assume it is in accordance with the law)

z3t4 · on May 18, 2024

You can take small snippets, but not the whole page. But why do you want to release someone else work publicly!? Just save it on your own device if you find it useful.

williamcotton · on May 18, 2024

In the context of the article, I assume because they want to archive the work for a future audience.

tuwtuwtuwtuw · on May 18, 2024

Depends on license.

williamcotton · on May 18, 2024

It doesn’t, actually:

https://www.copyright.gov/fair-use/

more likely to find that nonprofit educational and noncommercial uses are fair.

less likely to support a claim of a fair use than using a factual work (such as a technical article or news item).

if the use employs only a small amount of copyrighted material, fair use is more likely.

And since code tends to be more idea than expression most of it can be considered to not fall under copyright after the application of the Abstraction, Filtration, Comparison doctrine.

Of course if you piss off an entity with a bunch of money to throw at lawyers it could be a bigger issue, regardless if you’re in the right, because defending yourself can rack up legal fees.

I do copyright/patent/trade secret inspections of source code for a living.

EDIT: Yet again, downvoted for just stating the truth... The irony is that 10 years ago and before the controversies around LLMs this comment would not have garnered negative attention because the forum was all for weaker copyrights... when copyright affected musicians instead of programmers' bottom lines... Sigh...

chuckadams · on May 18, 2024

Fair use is an affirmative defense: you only get to pull it out after they sue you, and your claim is judged on a case by case basis. It’s not the magic spell of protection people seem to think it is.

williamcotton · on May 18, 2024

Yup, I know that, but I encourage people to not be afraid of republishing little bits of code for educational or archival purposes. They should know they do have fair use to lean on if they do end up in the incredibly unlikely situation of ending up in court.. unless of course they're painting a very large target on their backs and making quite a bit of money from the IP of a large corporation!

nicbou · on May 18, 2024

There are other countries with different laws. In Germany it would definitely be copyright infringement.

williamcotton · on May 18, 2024

The entire world benefits from sites like the Internet Archive and the commons-friendly approach of fair use. I recommend changing the laws in your country!

EDIT: I did a little bit of investigation and there are similar limits on copyrights in Germany known as "limitations on protected rights" that seem to carve out things like educational use and archiving, but I don't know anything about German law. I would find it surprising if most Western nations didn't have something similar to fair use unless there was an active interest in damaging the public's access to information.

zargath · on May 18, 2024

As many have stated, I would assume more than 38%. But good quality content is rare and the dynamic content made page combinations go infinite 20 years ago.

We are maintaining 25y old urls is a bit ehh. cumbersome, and I sometime wonder if its worth it. Most of the traffic seem to come from bots and they do seem to learn some of the 301's. It seems to be good for SEO, etc.

Some users also gets redirected to the content on the new urls. It feels a bit like helping an elderly person over the street to where the shop is.

Anyway, I hope that bots and humans trust our services more.

dotcoma · on May 18, 2024

including twitter.com ! ;)

mediumsmart · on May 19, 2024

This is about single pages that have been deleted in functional websites. How dare you?