2020.09.08;二Sep8th(252):Why I link to Wayback Machine instead of original web content | Hacker News

Published by DB, September 9th, 2020

		Why I link to Wayback Machine instead of original web content (hawaiigentech.com)
		537 points by puggo 19 hours ago \| hide \| past \| favorite \| 231 comments

markjgraham 9 hours ago [–]

We suggest/encourage people link to original URLs but ALSO (as opposed to instead of) provide Wayback Machine URLs so that if/when the original URLs go bad (link rot) the archive URL is available, or to give people a way to compare the content associated with a given URL over time (content drift)

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time... so that we are able to fix them if/when they break. We have rescued more than 10 million so far from more than 30 Wikipedia sites. We are now working to have Wayback Machine URLs added IN ADDITION to Live Web links when any new outlinks are added... so that those references are "born archived" and inherently persistent.

Note, I manage the Wayback Machine team at the Internet Archive. We appreciate all your support, advice, suggestions and requests.

jhallenworld 8 hours ago [–]

It's interesting to think about how HTML could be modified to fix the issue. Initial thought: along with HREF, provide AREF- a list of archive links. The browser could automatically try a backup if the main one fails. The user should be able to right-click the link to select a specific backup. Another idea is to allow the web-page author to provide a rewrite rule to automatically generate wayback machine (or whatever) links from the original. This seems less error prone and browsers could provide a default that authors could override.

Anyway, the fix should work even with plain HTML. I'm sure there are a bunch of corner cases and security issues involved..

Well as mentioned by others, there is a browser extension. It's interesting to read the issues people have with it:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

javajosh 6 hours ago [–]

So this is a little indirect, but it does avoid the case where the Wayback machine goes down (or is subverted): include a HASHREF which is a hash of the state of the content when linked. Then you could find the resource using the content-addressable system of your choice. (Including, it must be said, the wayback machine itself).

a1369209993 4 hours ago [–]

> (Including, it must be said, the wayback machine itself).

Citation needed? Eg something like http://web.archive.org/cdx/search/cdx?url=http://haskell.cs.... produces lines of the form:

  edu,yale,cs,haskell)/wp-content/uploads/2011/01/haskell-report-1.2.pdf 20170628055823 http://haskell.cs.yale.edu/wp-content/uploads/2011/01/haskell-report-1.2.pdf warc/revisit - WVI3426JEX42SRMSYNK74V2B7IEIYHAS 563

But there seems to be no documented way to turn WVI3426JEX42SRMSYNK74V2B7IEIYHAS (which I presume to be the hash) into a actual file. (Though http://web.archive.org/web/$DATEim_/$URL works fine, so it hasn't been a problem in practice.)

javajosh 2 hours ago [–]

> Citation needed

Oh, sorry, I don't think the WM supports this today. I only meant that it could support it "trivially" (I put that in quotes since I don't know how WM is implemented. But in theory it would be easy to hash all their content and add an endpoint that maps from hashes to URLs).

My point was that you could add an addressing system that is both independent of the Wayback Machine, but which you could still (theoretically) use with it. But you'd have to add the facility to the WM.

a1369209993 1 hour ago [–]

Ah, that's disappointing, but oh well.

ponker 4 hours ago [–]

I've found that web pages have so much dynamic content these days that even something that feels relatively static generates two different hashes almost on every pageload.

javajosh 2 hours ago [–]

Indeed. I don't think you could or should hash the DOM - not least of which because it is, in general, the structured output of a program. Ideally you could hash the source. This might be a huge problem for single page applications, except you can always pre-render a SPA at any given URL, which solves the problem. (This is done all the time - the most elegant way is to run e.g. React on the server to pre-render, but you can also use another templating system in an arbitrary language, although you end up doing all features maybe not twice, but about 1.5x).

shortformblog 7 hours ago [–]

This is literally where my brain was going and I was glad to see someone went in the same direction. Given the <img> tag’s addition of srcset in recent years, there is precedent for doing something more with href.

devenblake 7 hours ago [–]

Yup, I've been using the extension for probably about a year now and get the same issues they do. It really isn't that bad, most of the time backing out of the message once or twice does the trick, but it's funny because most of the time I get that message when going to the IA web uploader.

Arkanosis 8 hours ago [–]

This is so much better than INSTEAD.

Not for the sole reason that it leaves some control to the content owner while ultimately leaving the choice to the user, but also because things like updates and erratums (eg. retracted papers) can't be found in archives. When you have both, it's the best of both world: you have the original version, the updated version, and you can somehow have the diff between them. IMHO, this is especially relevant in when the purpose is reference.

tdrp 2 hours ago [–]

Thanks so much for running this site - as a small start-up we often manually request a snapshot of our privacy policy/terms of service/other important announcements whenever we make change to them (if we don't manually request them the re-crawl generally doesn't happen since I guess those pages are very rarely visited, even though they're linked from the main site). It's helped us in a thorny situation where someone tried to claim "it wasn't there when I signed up".

It might be an interesting use-case for you to check out, i.e. keep an eye of those rarely used legal sublinks for smaller companies.

tracker1 7 hours ago [–]

I mostly agree... however, given how many "news" sites are now going back and completely changing articles (headlines, content) without any history, I think it's a mixed bag.

Link rot isn't the only reason why one would want an archive link instead of original. Not that I'd want to overwhelm the internet archive's resources.

punnerud 7 hours ago [–]

I love the feature that you easily can add a page to archive: https://web.archive.org/save/https://example.com

Replace https://example.com from the URL above. I try to respect the cost of archiving, by not saving to often the same page.

Ziggy_Zaggy 8 hours ago [–]

Kudos for doing what you do.

arendtio 7 hours ago [–]

I always wonder about rise the hosting costs in the wake of people liking to the Wayback Machine on popular sites.

How do you think about it?

bartread 18 hours ago [–]

I'm not sure I'm a fan of this because it just turns WayBackMachine into another content silo. It's called the world wide web for a reason, and this isn't helping.

I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.

silicon2401 13 hours ago [–]

> But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.

3pt14159 13 hours ago [–]

The issue is that if this becomes widespread then we're going to get into copyright claims against the wayback machine. When I write content it is mine. I don't even let Facebook crawlers index it because I don't want it appearing on their platform. I'm happy to have wayback machine archive it, but that's with the understanding that it is a backup, not an authoritative or primary source.

Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

notriddle 7 hours ago [–]

There already have been copyright claims against The Wayback Machine. They've been responding to it by allowing site owners to use robots.txt to remove their content.

headmelted 8 hours ago [–]

But it’s also not guaranteed to be consistent. What if you don’t delete the content but just change it? (I.e. what if your opinions change or you’re pressured to edit information by a third party?).

3pt14159 6 hours ago [–]

I addressed this.

> I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

Updates are usually good. Sometimes you need to verify what was said though, and for that wayback machine works. I agree it would be nice if there was a technical way to support both, but for the average web request it's better to link to the source.

alisonkisk 12 hours ago [–]

Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.

bartread 11 hours ago [–]

> Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.

I'm trying to figure out if you're being ironic or serious.

People on here (rightly) spend a lot of time complaining about how user experience on the web is becoming terrible due to ads, pop-ups, pop-unders, endless cookie banners, consent forms, and miscellaneous GDPR nonsense, all of which get in the way of whatever it is you're trying to read or watch, and all of it on top of the more run-of-the-mill UX snafus with which people casually litter their sites.

Your idea boils down to adding another layer of consent clicking to the mess, to implement a semi-manual redirect through the WayBackMachine for every link clicked. That's ridiculous.

I have to believe you're being ironic because nobody could seriously think this is a good idea.

CarCooler 9 hours ago [–]

Agree, cut the clutter just like it is simple on the HN website.

PaulHoule 12 hours ago [–]

It's a deep problem with the web as we know it.

If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.

In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.

alisonkisk 12 hours ago [–]

Is it a deep problem? You can download content you want to keep. There are many services like evernote and pocket that can help you with it.

TeMPOraL 11 hours ago [–]

It is, because it ultimately comes down to owner's control of how their content is being used.

For example, a modern news site will want the ability to define which text is "authoritative", and make modifications to it on the fly, including unpublishing it. As a reader OTOH, I want a permanent, immutable copy of everything said site ever publishes, so that silent edits and unpublishing is not possible. These two perspectives are in conflict, and that conflict repeats itself throughout the entire web.

PaulHoule 10 hours ago [–]

Some consumers will want the latest and greatest content. To please everyone (other than the owner) you'd need to look at the content across time, versions, alternate world views,... Thus "deep".

My central use case is that I might 'scrape' content from sources such as

https://en.wikipedia.org/wiki/List_of_U.S._states_and_territ...

and have the process be "repeatable" in the sense that:

1. The system archives the original inputs and the process to create refined data outputs

2. If the inputs change the system should normally be able to download updated versions of the inputs, apply the process and produce good outputs

3. If something goes wrong there are sufficient diagnostics and tests that would show invariants are broken, or that the system can't tell how many fingers you are holding up

4. and in that case you can revert to "known good" inputs

I am thinking of data products here, but even if the 'product' is a paper, presentation, or report that involves human judgements there should be a structured process to propagate changes.

ethagnawl 9 hours ago [–]

> If it's not worth that, then I just keep a bookmark.

I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.

I keep meaning to write browser extensions to do both of these things on my behalf ...

ogre_codes 9 hours ago [–]

I can understand posting a link, plus an archival link just in case the original content is lost. But linking to an archival site only is IMO somewhat rude.

codethief 18 hours ago [–]

> What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine

Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.

thotsBgone 16 hours ago [–]

I don't care enough to look into it, but I think Gwern has something like this set up on gwern.net.

ethagnawl 9 hours ago [–]

Doesn't Wikipedia do something like this? If not, the WBM/Archive.org does something like it on Wikipedia's behalf.

abdullahkhalids 16 hours ago [–]

Gwern.net has a pretty sophisticated system for this https://www.gwern.net/Archiving-URLs

mcv 18 hours ago [–]

Would be nice if there's an automatic way to have a link revert to the Wayback Machine once the original link stops working. I can't think of an easy way to do that, though.

jazzyjackson 16 hours ago [–]

Brave browser has this built in, if you end up at a dead link the address bar offers to take you to wayback machine.

http://blog.archive.org/2020/02/25/brave-browser-and-the-way...

bad_user 16 hours ago [–]

This was first implemented in Firefox, as an experiment, and is now an extension:

https://addons.mozilla.org/ro/firefox/addon/wayback-machine_...

liability 13 hours ago [–]

I used this extension for a while but had to stop due to frequent false positives. YMMV

CompuHacker 12 hours ago [–]

There exists a manual extension called Resurrect Pages for Firefox 57+, with Google Cache, archive.is, Wayback Machine, and WebCite.

boogies 14 hours ago [–]

I just use a bookmarklet

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href.replace(/\/$/,%20'')));

(which is only slightly less convenient than what others have already pointed out — the FF extension and Brave built-in feature).

kevincox 13 hours ago [–]

Another nice solution is to create a "search engine" for https://web.archive.org/web/*/%s you can then just add the keyword before the URL (For example I type `<Ctrl-l><Left>w <Enter>`). Search engines like this are supported by chrome and firefox.

boogies 12 hours ago [–]

I would love for there to be a site that redirected eg. better.site/ https://www.youtube.com/watch?v=jzwMjOl8Iyo to https://invidious.site/watch?v=jzwMjOl8Iyo so I could easily open YouTube links with Invidious, and the same for Twitter→Nitter, Instagram→bibliogram, Google Maps → OSM, etc without having to manually remove the beginning of the URL. I’d presume someone on HN has the skill to do this similarly to https://news.ycombinator.com/item?id=24344127

kevincox 12 hours ago [–]

You can make a "search engine" or bookmarklet that is a javascript/data URL that does whatever URL mangling you need. (Other than some minor escaping issues).

Something like the following should work. You can add more logic to supoort all of the sites with the same script or make one per site.

javascript:document.location="%s".replace(/^https:\/\/www.youtube.com/, "https://invidious.site")

riffraff 17 hours ago [–]

wikipedia just does "$some-link-here (Archived $archived-version-link)", and it works pretty well, imo.

notagoodidea 17 hours ago [–]

For me that is the real solution when you know that the archived-link is the one consulted by the author/whatever and the normal one being the content (or its evolution).

canofbars 3 hours ago [–]

iirc wikipedia has some logic for this. When you add a reference it automatically makes sure the page is backed up and if not it triggers a wayback copy, then it scans for dead links in references and if one is found it replaces the link with wayback.

II2II 14 hours ago [–]

Agreed, and it shouldn't be too much of a burden to use since the author was quite clear about it being for reference materials. The idea isn't all that different from referring to specific print editions.

MaxBarraclough 18 hours ago [–]

Either a browser extension, or an 'active' system where your site checks the health of the pages it links to.

iggldiggl 17 hours ago [–]

> browser extension

E.g. https://addons.mozilla.org/firefox/addon/wayback-machine_new...

DavideNL 17 hours ago [–]

Their browser extention does exactly that...

jrochkind1 14 hours ago [–]

The International Internet Preservation Consortium is attempting a technological solution that gives you the best of both worlds in a flexible way, and is meant to be extended to support multiple archival preservation content providers.

https://robustlinks.mementoweb.org/about/

(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)

akavel 16 hours ago [–]

Or: snapshot a WARC archive of the site locally, then start serving it only in case the original goes down. For extra street cred, seed it to IPFS. (A.k.a. one of too many projects on my To Build One Day list.)

nikisweeting 11 hours ago [–]

ArchiveBox is built for exactly this use-case :)

https://github.com/pirate/ArchiveBox

NateEag 15 hours ago [–]

I use linkchecker for this on my personal sites:

https://linkchecker.github.io/linkchecker/

There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.

https://www.npmjs.com/package/broken-link-checker-local

privong 15 hours ago [–]

> There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories

linkchecker can do this as well, if you provide it a directory path instead of a url.

NateEag 15 hours ago [–]

Ah, thanks! I was not aware of that feature.

polygot 14 hours ago [–]

I made a browser extension which replaces links in articles and stackoverflow answers with archive.org links on the date of their publication (and date of answers for stackoverflow questions): https://github.com/alexyorke/archiveorg_link_restorer

mark-r 3 hours ago [–]

And when you die, who will be maintaining your personal site? What happens when the domain gets bought by a link scammer?

Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.

canofbars 3 hours ago [–]

Wayback machine converts all links on a page to wayback links so you can navigate a dead site normally.

mark-r 3 hours ago [–]

Well that's a bummer. Any way to defeat it?

FinnLeSueur 14 hours ago [–]

> generate a report of broken links

I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.

https://github.com/finnito/link-checker

DeusExMachina 16 hours ago [–]

> generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.

deepstack 17 hours ago [–]

yeah at some point, way back machine need to be on webttorrent, ipfs type of thing where it is immutable.

scruffyherder 15 hours ago [–]

I was surprised when digital.com got purged

Then further dismayed that the utzoo Usenet archives were purged.

Archive sites are still subject to being censored and deleted.

alfonsodev 16 hours ago [–]

it's there any active project perusing this idea ?

nikisweeting 11 hours ago [–]

The largest active project doing this (to my knowledge) is the Inter-Planetary Wayback Machine:

https://github.com/oduwsdl/ipwb

There have been many other attempts though, including internetarchive.bak on IPFS, which ended up failing because it was too much data.

http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-...

Taek 15 hours ago [–]

https://github.com/exp0nge/wayback

Here's an extension to archive pages on Skynet, which is similar to IPFS but uses financial compensation to ensure availability and reliability.

I don't know if the author intends to continue developing this idea or if it was a one-off for a hackathon.

Sargos 11 hours ago [–]

FileCoin is the incentivization layer for IPFS, both built by Protocol Labs.

deepstack 16 hours ago [–]

I'm hoping someone here in Hacker News will pick it up and apply for the next round at ycombinator. A non-profit would be better than for-profit in this case. Block-chain ish type tech would be perfect for this. If in a few years no one does, then I'll do it.

toomuchtodo 16 hours ago [–]

https://blog.archive.org/2018/07/21/decentralized-web-faq/

zwayhowder 7 hours ago [–]

Not to forget that while I might go to an article written ten years ago, the Wayback archive won't show me a related article that you published two years ago updating the article information or correcting a mistake.

scruffyherder 15 hours ago [–]

I spent hours getting all the stupid redirects working from different hosts, domains and platforms.

People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.

It’s a shame ping backs were hijacked but the siloing sucks too.

Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.

1vuio0pswjnm7 16 hours ago [–]

What if your personal site is, like so many others these days, on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.

In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)

I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.

bad_user 16 hours ago [–]

The "target site" is the URL from the author's domain, and Cloudflare is the domain's designated CDN. The user is reaching the server that the webmaster wants reachable.

That's how the web works.

> The nice thing about archive.org is that it does not require SNI

I fail to see how that's even a thing to consider.

1vuio0pswjnm7 5 hours ago [–]

If the user follows an Internet Archive URL (or Google cache URL or BING cache URL or ...), then does she still she reach "the server the webmaster wants reachable".

SNI, more specifically sending domain names in plaintext over the wire when using HTTPS, matters to the IETF because they have gone through the trouble of encrypting server certificate in TLS 1.3 and eventually they will be encrypting SNI. If you truly know "how the web works", then you should be able to figure out why they think domain names in plaintext is an issue.

bherb 16 hours ago [–]

Here, I fixed your link: https://web.archive.org/web/20200908090515/https://hawaiigen...

shemnon42 11 hours ago [–]

Came here for this. Have my upvote.

outsomnia 18 hours ago [–]

This is a bad idea...

In the worst case one might write a cool article and get two hits, one noticing it exists, and the other from the archive service. After that it might go viral, but the author may have given up by then.

The author is losing out on inbound links so google thinks their site is irrelevant and gives it a bad pagerank.

All you need to do is get archive.org to take a copy at the time, you can always adjust your link to point to that if the original is dead.

bryanrasmussen 18 hours ago [–]

There's no reason that pagerank couldn't be adapted to take into account wayback machine urls, there is a link with a url pointing at https://web.archive.org/web/*/https://news.ycombinator.com/ google could easily register that as a link to both resources - one to web.archive, the other to the site.

there is also no reason why that has to become a slippery slope, if anyone is going to ask "but where do you stop!!"

dmitriid 18 hours ago [–]

After all, they did change their search to accommodate AMP. Changing it to take WebArchive into account is a) peanuts and b) is actually better for the web

TheSpiceIsLife 17 hours ago [–]

There's a business idea in there somewhere.

Some kind of CDN-edge-archive hybrid.

britmob 16 hours ago [–]

"CDN-Whether-You-Want-It-Or-Not"

quickthrower2 16 hours ago [–]

Foreverspin meets Cloudflare

ethanwillis 18 hours ago [–]

Google shouldn't be the center of the Web. They could also easily determine where the archive link is pointing to and not penalize. But I guess making sure we align with Google's incentives is more important than just using the Web.

bartread 18 hours ago [–]

> Google shouldn't be the center of the Web.

I agree, but are you suggesting it's going to be better if WayBackMachine is?

ethanwillis 17 hours ago [–]

That's a strawman because I never said they should be. There's room for better alternatives.

We as a community need to think bigger rather than resigning ourselves to our fate.

bartread 11 hours ago [–]

It's not a strawman because (a) I agreed with you, (b) context, and (c) I asked a question based on what you seemed to be implying in that context: a question to which you still haven't provided an answer.

Let me put it another way: what specifically are you suggesting as an alternative?

ethanwillis 8 hours ago [–]

If I had to pick a solution from what's available right now technology wise I'd pick something that links based on content hashes. And then pulls the content from decentralized hosting.

I don't think I like IPFS as an organization, but tech wise it's probably what I'd go with.

encom 18 hours ago [–]

Yes. At least Archive.org isn't an evil mega corporation destroying the internet. Yet.

rriepe 18 hours ago [–]

We'll see what their new owners do after the lawsuit.

rchaud 11 hours ago [–]

Every search engine uses the number of backlinks as one of the key factors in influencing search rank; it's a fundamental KPI when it comes to understanding whether a link is credible.

What is true for Google in this regard is also true of Bing, DDG and Yandex.

luckylion 14 hours ago [–]

> But I guess making sure we align with Google's incentives is more important than just using the Web.

It's not about Google's incentives. It's about directing the traffic where it should go. Google is just the means to do so.

Build an alternative, I'm sure nobody wants Google to be the number one way of finding content, it's just that they are, so pretending they're not and doing something that will hurt your ability to have your content found isn't productive.

marcus_holmes 14 hours ago [–]

I totally agree.

I guess the answer is "don't mess with your old site", but that's also impractical.

And I'm sorry, but if it's my site, then it's my site. I reserve the right to mess about with it endlessly. Including taking down a post for whatever reason I like.

I'm sorry if that conflicts with someone else's need for everything to stay the same but it's my site.

Also, if you're linking to my article, and I decide to remove said article, then surely that's my right? It's my article. Your right to not have a dead link doesn't supercede my right to withdraw a previous publication, surely?

pingpongchef 11 hours ago [–]

You can go down this road, but it looks like you're advocating for each party to simply do whatever he wants. In which case the viewing party will continue to value archiving.

mitchdoogle 12 hours ago [–]

I certainly don't know about legal rights, but I think the ethical thing is to make sure that any writings published as freely accessible should remain so forever. What would people think if an author went into every library in the world to yank out one of their books they no longer want to be seen?

I do think the author is wrong to immediately post links to archived versions of sources. At the least, he could link to both the original and archived.

badprose 9 hours ago [–]

Publishing on your own website is more akin to putting up a signboard on your front lawn than writing a book for publication.

People are free to view it and take pictures for their own records, but I could still take it down and put something else up.

fwip 5 hours ago [–]

Why is that the most ethical thing to do?

As a motivating example, I wrote some stuff on my MySpace page as a teenager that I'm very glad is no longer available. They were published as "freely accessible" and indeed, I wanted people to see it. But when I read it back 15 years later, I was more than a little embarrassed about it, and I deleted it - despite it also having comments from my friends at the time, or being referenced in their pages.

No great value was contained in those works.

marcus_holmes 11 hours ago [–]

I'm not sure I agree. I know that journalism (as a discipline) considers this ethical. I kinda get that this is part of the newspaper industry as a public service - that withdrawing publication of something, or changing it without alerting the reader to the change, alters the historical record.

But no-one has a problem with other creative industries withdrawing their publications. Film-makers are forever deciding that movies are no longer available, for purely commercial reasons. Why is writing different? Why is pulling your books from a library unethical but pulling your movie from distribution is OK?

I think we either need to extend this to all creative activity, or reconsider it for writing.

Fargren 9 hours ago [–]

> But no-one has a problem with other creative industries withdrawing their publications

I wouldn't say no one has a problem with this. It does happen, but it certainly doesn't make everyone happy. I for one would like for all released media to be available, or at least not actively removed from access.

falcolas 9 hours ago [–]

This has a very easy answer for me: It's not ethical for film makers to decide that movies are no longer available.

Copyright was created to encourage publication of information, not to squirrel it away. Copyright should be considered the exception of the standard - public domain.

fwip 6 hours ago [–]

Why not?

Is it unacceptable for an artist to throw her art away after it has finished its museum tour? Should a parent hang on to every drawing their child has ever made?

If you are a software developer - is all of the code you've ever written still accessible online, for free? (To the legal extent that you are able, of course.)

Have you written a blog before, or did you have a MySpace? Have you taken care to make sure your creative work has been preserved in perpetuity, regardless on how you feel about the artistic value of displaying your teen emotions?

Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.

falcolas 5 hours ago [–]

> Is it unacceptable for an artist to throw her art away after it has finished its museum tour? Should a parent hang on to every drawing their child has ever made?

This boils down to the public domain, IMO. We have made a long practice of rescuing art from private caches and trash bins to make them publicly available after the artists' passing (the copyright expiring); regardless of their views on what should happen with those works.

> Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.

Selling something and then pulling it down is fundamentally an attempt to create scarcity for something that would otherwise be freely available. It's a marketing technique that capitalizes on our fear of missing out to make a sale.

Again, the right to even sell writings was enshrined in law as an exception to the norm of of it immediately being part of the public domain, in an effort to encourage more writing.

marcus_holmes 3 hours ago [–]

Not sure we need any more encouragement on that front ;)

Sure, after I'm dead, you can do with my stuff whatever you like.

But while I'm alive.... it's my stuff and I can do with it what I like. Including tearing it up because I hate it now and don't want anyone to look at it.

johannes1234321 18 hours ago [–]

One can also do it similar to Wikipedia references sections, which links to the original and the memento in the archive. (Once the bot notices it's gone)

Additional benefit: Some edits are good (addendums, typo corrections etc.)

acatton 17 hours ago [–]

archive.org sends the HTTP header

  Link: <https://example.com>; rel="original"

This can be used by search engines to adjust their ranking algorithms.

scruffyherder 15 hours ago [–]

Even worse, when you have people using rss to wholesale copy your site and it’s updates and again that traffic and more importantly the engagement disappear.

It’s very demotivating

CaptArmchair 18 hours ago [–]

So, this is the problem of persistence of URL's always referencing the original content, regardless of where it is hosted, in an authoritative way.

It's an okay idea to link to WB, because (a) it's de facto assumed to be authoritative by the wider global community and (b) as an archive it provides a promise that it's URL's will keep pointing to the archived content come what may.

Though, such promises are just that: promises. Over a long period of time, no one can truly guarantee the persistence of a relationship between an URI and the resource it references to. That's not something technology itself solves.

The "original" URI still does carry the most authority, as that's the domain on which the content was first published. Moreover, the author can explicitly point to the original URI as the "canonical" URI in the HTML head of the document.

Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

Part of ensuring persistence is the responsibility of original publisher. That's where solutions such as URL resolving come into play. In the academic world, DOI or handle.net are trying to solve this problem. Protocols such as ORE or Memento further try to cater to this issue. It's a rabbit hole, really, when you start to think about this.

kapep 18 hours ago [–]

> Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

WB also supports linking to the very latest version. If the archive is updated frequently enough I would say it is reasonable to link to that if you use WB just as a mirror. In some cases I've seen error pages being archived after the original page has been moved or removed though but that is probably just a technical issue caused by some website misconfiguration or bad error handling.

im3w1l 16 hours ago [–]

Signed HTTP Exchanges could be a neat solution here.

ffpip 19 hours ago [–]

You can create a bookmark in Firefox to save a link quickly.

Bookmark Location- https://web.archive.org/save/%s

Keyword - save

So searching 'save https://news.ycombinator.com/item?id=24406193' archives this post.

You can use any Keyword instead of 'save'.

You can also search with https://web.archive.org/*/%s

bad_user 16 hours ago [–]

Does that `save` keyword work?

The problem is %s gets escaped, so Firefox generates this URL, which seems to be invalid:

https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator....

aendruk 11 hours ago [–]

Uppercase %S for unescaped, e.g.:

https://web.archive.org/web/*/%S

bad_user 10 hours ago [–]

Ah, nice, thanks!

ffpip 13 hours ago [–]

web.archive.org automatically converts the https%3A%2F things to https:// for me. I noticed it many times.

If you are still facing problems, go to https://web.archive.org . In the bottom right 'Save page now' field, right click and select 'add keyword for search'. Choose your desired keyword.

fireattack 7 hours ago [–]

>web.archive.org automatically converts the https%3A%2F

Did you try the link provided by the one you replied to?

Because it says "HTTP 400" here, so apparently it doesn't convert well, at least not in my end.

kilroy123 18 hours ago [–]

Nice. I forgot how you can do that.

I just use the extension myself:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

badsectoracula 7 hours ago [–]

One issue i have with this extension is that it randomly pops up the 'this site appears to be offline' (which overrides the entire page) even when the site actually works (i hit the back button and it appears). I have it installed for some time now and so far i have almost daily false negatives and only once actually it worked as intended.

Also there doesn't seem to be a way to open a URL directly from the extension which seems a weird omission, so i end up going to the archive site anyway since i very often want to find old long lost sites.

fireattack 7 hours ago [–]

It pops up when there is a HTTP 404 status code or similar returned. So these false negatives are likely due to the specific sites that are configured in a wacky way.

(Don't get me wrong, it is still very annoying for the user regardless what the cause is.)

badsectoracula 7 hours ago [–]

Does it pop up for any 404 error? If so it might be some script or font or whatever resource the site itself is using that would otherwise fail silently. If not... then there has to be some other bug/issue because i get it for many different sites that shouldn't have it.

fireattack 6 hours ago [–]

Nope, only for the "main" page (for lack of a better word), and when there is an archive for it.

ffpip 18 hours ago [–]

Yeah. That requires access to all sites. I wasn't comfortable adding another addon with that permission.

The permission is just for a simple reason and should be off by default. It is so you can right click a link on any page and select 'archive' from the menu. Small function, but requires access to all sites.

robotron 11 hours ago [–]

The source is available if you want to know what's going on with those permissions: https://github.com/internetarchive/wayback-machine-chrome

ffpip 11 hours ago [–]

Thanks. I already knew that. I'm familiar with the dev's extensions. Clear Browsing Data and Captcha Buster and very useful.

imhoguy 7 hours ago [–]

This is building yet another silo and point of failure. We can't pass the entire Internet traffic thru WayBackMachine as its resources are limited.

Most preserving solutions are like that and at the end the funding or business priorities (google groups) become a serious problem.

I think we need something like web - distributed and dumb easy to participate and contribute a preservation space.

Look, there are Torrents available for 17 years [0]. Sure, there are some unintresting long gone but there is always a little chance somebody still has the file and someday becomes online with it.

I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is too complex for a layman contributor with a plain altruistic motivation. It should be like SETI@Home - fire and forget. Eventually integrated with a browser to cache content you star/bookmark and share when it is offline.

[0] https://torrentfreak.com/worlds-oldest-torrent-still-alive-a...

kibibu 19 hours ago [–]

Can we update this link to point to the archive version?

drummer 9 hours ago [–]

Brilliant

mountainb 16 hours ago [–]

Link rot has convinced me that the web is not good for its ostensible purpose. I used to roll my eyes reading how academic researchers and librarians would discourage using webpages as resources. Many years later, it's obvious that the web is pretty bad for anything that isn't ephemeral.

sfg 14 hours ago [–]

We have deposit libraries in the U.K., such as The British library and Oxford University's Bodleian. When you publish a book in the U.K. you are supposed to offer a copy to these institutions.

If we had legal deposit web archiving institutions, then academics, and others, could create an archive snapshot of some resource and then reference the URI to that (either with or without the original URI), so as to ensure permanence.

ImaCake 16 hours ago [–]

>I used to roll my eyes reading how academic researchers and librarians would discourage using webpages as resources.

While this is true in general, I am amused that this is not true for citing wikipedia. Wikipedia can be trusted to remain online for many more years to come. And it has a built-in wayback machine in the form of Revision History.

mountainb 15 hours ago [–]

Try following the references on big Wiki pages and you will see why Wikipedia pages are nightmarish for any kind of research. This is important when you are trying to drill down to the sources of various claims. Many major pages relating to significant events and concepts are riddled with rotted links.

The page can be completely correct and accurate, but if you cannot trace the references then it cannot be verified and you cannot make the claims in a new work as a result. The whole point of references is to make it so that the claims can be independently verified. Even when there isn't a link rot problem you will often find junk references that cannot be verified.

Wikipedia isn't a bad starting point and sometimes you can find good references. But it is not anywhere close to reliable: just trace the references in the next 20 Wiki articles you read and your faith will be shaken.

techphys_91 15 hours ago [–]

Usually a reference indicates that an author believes something to be true, but won't explicitly state their reasons. It isn't just a statement of where information comes from, but a justification for trusting that information. If the reference is from a reputable source, then it indicates that this belief is justified. If an author believes something to be true because they read it on wikipedia, then that belief probably isn't justified, because the reliability of wikipedia content is mixed.

Good quality information on wikipedia often refers back to published sources, and at the very least an author should check that source and refer to it, rather than wikipedia itself.

scruffyherder 15 hours ago [–]

After someone published an authoritative ftp listening, so many people panicked as their were out of date and insecure versions so rather than patch they all went dark.

Anyone doing research just got screwed.

So many papers have code listed to places that don’t exist anymore.

codetrotter 16 hours ago [–]

By that reasoning, shouldn’t you be be using WayBack Machine links when posting your own content to HN, instead of posting direct links?

NateEag 15 hours ago [–]

I understand where the author is coming from, but I think the best approach is to write your content with direct links to the canonical versions of articles.

Have a link checking process you run regularly against your site, using some of the standard tools I've mentioned elsewhere in this thread:

https://www.npmjs.com/package/broken-link-checker-local

https://linkchecker.github.io/linkchecker/

When you run the link check (which should be regularly, perhaps at least weekly), also run a process that harvests the non-local links from your site and 1) adds any new links' content to your own local, unpublished archive of external content, and 2) submits those new links to archive.org.

This keeps canonical URLs canonical, makes sure content you've linked to is backed up on archive.org so a reasonably trustworthy source is available should the canonical one die out, and gives you your own backup in case archive.org and the original both vanish.

I don't currently do this with my own sites, but now I'm questioning why not. I already have the regular link checks, and the second half seems pretty straightforward to add (for static sites, anyway).

cornedor 19 hours ago [–]

But how certain is the future of WayBackMachine, when disaster strikes, all your links are dead. On the other hand, the original links can still be read from the url, so the original reference is not completely gone.

INTPenis 19 hours ago [–]

Yeah, my thoughts were more of the way Waybackmachine is funded.

I don't feel comfortable sending a bunch of web traffic to them for no reason other than it being convenient. The wayback machine is a web archival project, not your personal content proxy to make sure your links don't go stale.

They need our help both in funding and in action, one simple action is not to abuse their service.

sanitycheck 18 hours ago [–]

Precisely my first thoughts, too. It's an archive, not a free CDN.

I hope the author of this piece considers donating and promoting donation to their readers: https://archive.org/donate/

dredmorbius 18 hours ago [–]

INTERNETARCHIVE.BAK:

The INTERNETARCHIVE.BAK project (also known as IA.BAK or IABAK) is a combined experiment and research project to back up the Internet Archive's data stores, utilizing zero infrastructure of the Archive itself (save for bandwidth used in download) and, along the way, gain real-world knowledge of what issues and considerations are involved with such a project. Started in April 2015, the project already has dozens of contributors and partners, and has resulted in a fairly robust environment backing up terabytes of the Archive in multiple locations around the world.

https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....

Snapshots from 2002 and 2006 are preserved in Alexandria, Egypt. I hope there's good fire suppression.

https://www.bibalex.org/isis/frontend/archive/archive_web.as...

phendrenad2 18 hours ago [–]

I wish there were a way to get a low-rez copy of their entire archive. So, only text, no images, binaries, PDFs (other than PDFs converted to text which they seem to do). As it stands the archive is so huge, the barrier to mirroring is high.

dredmorbius 17 hours ago [–]

Agreed.

When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.

If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.

But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.

ta8908695 13 hours ago [–]

The external assets for a page could be archived separately though, right? I would think that the static G+ assets: JS, CSS, images, etc. could be archived once, and then all the remaining data would be much closer the 120B of real content. Is there a technical reason that's not the case?

dredmorbius 13 hours ago [–]

In theory.

In practice, this would likely involve recreating at least some of the presentation side of numerous changing (some constantly) Web apps. Which is a substantial programming overhead.

WARC is dumb as rocks, from a redundancy standpoint, but also atomically complete, independent (all WARCs are entirely self-contained), and reliable. When dealing with billions of individual websites, these are useful attributes.

It's a matter of trade-offs.

nikisweeting 11 hours ago [–]

So archive your links yourself with one of the many local-web-archiving tools.

https://webrecorder.io

https://github.com/pirate/ArchiveBox

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

Lex-2008 19 hours ago [–]

WayBackMachine alternative, archive.is, has an option to download zip archive of HTML with images and CSS (but no JS) - this way you can preserve and host a copy of original webpage on your own website

moonchild 18 hours ago [–]

Or just wget -rk...

Mirroring a website isn't so hard that you need a service to do it for you. Your browser even has such a function; try ctrl-s.

abricot 17 hours ago [–]

The "SingleFile" plugin is a better version of ctrl+s. It will save all pages as single html file and even include images as an octet stream in the file so they aren't missed.

peq 13 hours ago [–]

I would be careful in mirroring a site. It's very likely to violate copyright or similar laws, depending on where you are. I think archive.org is considered fair use, but if you put it on a personal or even business page it might be different. For example Google News in EU is very limited in what content they may steal from other web pages.

oblio 19 hours ago [–]

Doesn't the link to the WayBackMachine contain the original link?

romwell 19 hours ago [–]

Good idea, by why not both (i.e. link to a webpage, and to the Archive)?

Linking to Archive only makes Archive a single point of failure.

sseneca 18 hours ago [–]

Yes, this makes the most sense in my opinion:

Check out [this link](https://...) ([archived](https://...))

This can also help in the event of a "hug of death"

roberto 5 hours ago [–]

This is what I do on my blog, with some additional metadata:

    <p>
      <a 
        data-archive-date="2020-09-01T22:11:02.287871+00:00"
        data-archive-url="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs"
        href="https://reubenwu.com/projects/25/aeroglyphs"
      >
        Aeroglyphs
      </a>
      <span class="archive">
        [<a href="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs">archived</a>]
      </span>
      is an ongoing series of photos of nature with superimposed geometrical shapes drawn by drones.
    </p>

iib 18 hours ago [–]

By the way the archive works, isn't the link just adding the https://web.archive.org/web/*/ before the actual link? I guess linking to both is especially important for people not knowing about the existence of archive.org, and a small convenience for everyone. But the link seems to be reversible in either direction.

dredmorbius 18 hours ago [–]

The WBM link includes the canonical source clearly within the URL.

romwell 14 hours ago [–]

Yeah, and the non-technical users will surely understand that what they need to do when the link doesn't work is:

1. Recognize that it's an Archive.org URL

2. Understand that the link references an archived page whose URL is "clearly" referenced as a parameter

3. Edit the URL (especially pleasant on a cell phone) correctly and try loading that

If you expect the user to be able to go through all this trouble if the Archive is down, you can also expect them to look up the page on the Archive if the link does not load.

But better yet, one shouldn't expect either.

hinkley 10 hours ago [–]

I wonder if the anchor tag should be altered to support this?

Alternatively, this is a good thing for a user agent to handles natively, or through a plugin.

thunderrabbit 18 hours ago [–]

Agreed. I usually link to both the original and then archive.org in parentheses.

susam 18 hours ago [–]

I think the fundamental problem here is that URLs locate resources. We find the desired content by finding its location given by an address. Now what server or content lives on that address may change from time to time or may even disappear. This leads to broken links.

The problem with linking to Wayback Machine is that we are still writing archive.org URLs still linking to Wayback Machine servers. What guarantee is there that those archive.org links will not break in future?

It would have been nice if the web were designed to be content-addressable. That is, the identifier or string we use to access a content addresses the content directly, not a location where the content lives. There is good effort going on in this area in the InterPlanetary File System (IPFS) project but I don't think the mainstream content providers on the Internet are going to move to IPFS anytime soon.

yreg 18 hours ago [–]

I'm all for Archive.org. However, using it in this way — setting up a mirror of some content and purposefuly diverting traffic to said mirror — is copyright infringement (freebooting), as it competes with the original source.

j1elo 16 hours ago [–]

This is a bad idea for the reasons that other commenters have already stated. If WayBackMachine falls, all links would fall. Actually the "Web" would stop being one, if all links are all within the same service.

For docs and other texts, I just link to the original site and add an (Archive) suffix, e.g. the "Sources" section in https://doc-kurento.readthedocs.io/en/latest/knowledge/nat.h...

That is a simple and effective solution, yes it is a bit more cumbersome, but it does not bother me.

asdfman123 11 hours ago [–]

> So in Feb 14 2019 your users would have seen the content you intended. However in Sep 07 2020, your users are being asked to support independent Journalism instead.

Can you believe it? Yesterday, I tried to walk out of the grocery store with a head of lettuce for free, and they instead were more interested in making me pay money to support the grocery and agricultural business!

monktastic1 10 hours ago [–]

Right. I thought it was pretty bad form for him to call this "spam," as though they're the ones wronging him.

koboll 12 hours ago [–]

This seems like a problem that would be better solved by something like:

1. Browsers build in a system whereby if a link appears dead, they first check against the Wayback Machine to see if a backup exists.

2. If it does, they go there instead.

3. In return for this service, and to offset costs associated with increased traffic, they jointly agree to financially support the Internet Archive in perpetuity.

aldo712 17 hours ago [–]

Here's a WayBackMachine Link to this article. :)

https://web.archive.org/web/20200908090515/https://hawaiigen...

wolco 9 hours ago [–]

No one touched on this but the experience of viewing through the waybackmachine is awful.

Media many times will not be saved so pages look broken. The iframe and the iframe breakers on original sites can kill any navigating.

The waybackmachine is okay for researching but a poor replacement as a perm link.

ethagnawl 9 hours ago [–]

> Media many times will not be saved so pages look broken.

In my experience, this has gotten much, much better in the last few years. I haven't explored enough to know if this is part of the archival process or not, but I've noticed on a few occasions that assets will suddenly appear some time after archiving a page. For instance, when I first archived this page (https://web.archive.org/web/20180928051336/https://www.intel...), none of the stylesheets, scripts, fonts or images were present. However, after some amount of time (days/weeks) they suddenly appeared and I was able to use the site as it originally appeared.

hownottowrite 15 hours ago [–]

Awesome. Hey, mods... Can you change the link on this post to http://web.archive.org/web/20200908090515/https://hawaiigent...

nikisweeting 11 hours ago [–]

Or link to your own archive of the content with ArchiveBox!

That way we're not all completely reliant on a central system. (ArchiveBox submits your links to Archive.org in addition to saving them locally).

https://github.com/pirate/ArchiveBox

Also many other tools that can do this too:

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

dltj 18 hours ago [–]

Take a look at _Robustify Your Links_.[1] It is an API and a snippet of JavaScript that saves your target HREF in one of the web archiving services and adds a decorator to the link display that offers the option to the user to view the web archive.

[1] https://robustlinks.mementoweb.org/about/

PhilosAccnting 3 hours ago [–]

Thank you! I've only been using the labor-intensive trust-issues version of this: paraphrasing things in my own words and linking to THAT.

I think I've been curating about 200 essays so far like that. You're now making me rethink my flow.

shortformblog 15 hours ago [–]

This man’s entire argument is completely terrible for two reasons:

1) The example he uses is The Epoch Times, a questionable source even on the best of days.

2) What he refers to as "spam" is a paywall. He is literally taking away from business opportunities for this outlet that produced a piece of content he wants to draw attention to, but he does not want to otherwise support.

He’s a taker. And while the Wayback Machine is very useful for sharing archived information, that’s not what this guy is doing. He’s trying to undermine the business model of the outlets he’s reading.

The Epoch Times is one thing—it’s an outlet that is essentially propaganda—but when he does this to a local newspaper or an actual independent media outlet, what happens?

zdw 12 hours ago [–]

For reference: https://en.wikipedia.org/wiki/Epoch_Times

They're hyper right wing Qanon/antivax spreaders associated with the Falun Gong movement.

Ensorceled 15 hours ago [–]

> 2) What he refers to as "spam" is a paywall. He is literally taking away from business opportunities for this outlet that produced a piece of content he wants to draw attention to, but he does not want to otherwise support.

For the destination site, this is all of the downsides of AMP with none of the upsides.

wila 15 hours ago [–]

The idea of being able to access the URL once it is gone is good. However this also means that any updates made to the original page are no longer seen.

Not all updates are about "begging for money" as the example in the article.

celsoazevedo 19 hours ago [–]

Is there any WordPress plugin that adds a link to the WayBack Machine next to the original link? I would use something like that.

aargh_aargh 19 hours ago [–]

Look at the format of the wayback machine URL. It's trivial to generate.

Where a WP plugin would add value is by saving to the archive whenever WP publishes a new or edited article.

dredmorbius 18 hours ago [–]

Perhaps: https://wordpress.org/plugins/media-library-internet-archive...

krapp 8 hours ago [–]

Apropos of nothing but I added the ability to archive links in Anarki a few months back[0]. If dang or someone wants to take it for HN they're welcome to. Excuse the crappy quality of my code and pr format, though.

It might be useful as a backup if the original site starts getting hugged to death.

[0]https://github.com/arclanguage/anarki/pull/179

rmoriz 11 hours ago [–]

I once discovered an information leak of German public broadcasting organization ARD which leaked real mobile numbers on their CI/CD page where they showed the business card designs (lol).

All records of this page on Archive.org were deleted after a couple of days, a twitter account posting the details with a screenshot and link was reported and my account temporarily suspended.

I assume it must be very easy to remove inconvenient content from archive.org.

(in German) https://blog.rolandmoriz.de/2019/04/25/sind-die-leute-von-de...

AnonHP 11 hours ago [–]

WayBackMachine is slow (slower than many bloated websites). So it’s not a good enough experience for the person clicking on that link.

Secondly, I personally don’t like the fact that WayBackMachine doesn’t provide an easy way to get content removed and to stop indexing and caching content (the only way I know is to email them, with delayed responses or responses that don’t help). It’s far easier to get content de-indexed in the major search engines. I know that the team running it have some reasons to archive anything and everything (as) permanently (as possible), but it doesn’t serve everybody’s needs.

fornowiamhere 18 hours ago [–]

> Now it’s spam from a site suffering financial need. Well, yeah!

Of course, linking to WBM is not the main reason why a site might be in this situation but it piles up.

runxel 13 hours ago [–]

While I certainly wouldn't do this with every page and also not every time, I got so anxious of link rot lately I save out of reflex any good content I come across to the Waybackmachine.

The use of the bookmarklet makes this really convenient.

samatman 8 hours ago [–]

This is such a fundamental problem that I'd like to be able to solve it at the HTML level.

An anchor type which allows several URLs, to be tried in order, would go a long way. Then we could add automatic archiving and backup links to a CMS.

It isn't real content-centric networking, which is a pity, but it's achievable with what we have.

cpcallen 13 hours ago [–]

This seems like a risky strategy, what with the pending lawsuit against archive.org over their National Emergency Library: I am fully expecting that web.archive.org will go away permanently within a few years.

LostJourneyman 10 hours ago [–]

There's some subtle irony in that the linked site is not in fact a WayBackMachine link, but instead a direct link to the site.

luord 13 hours ago [–]

While I generally disagree because I'd rather my site was the one getting the hits—and I would rather give the same courtesy to other authors—this does give me the idea of checking (or creating if none exists) an archive link of whatever I reference, and include that archive link in the metadata of every link I include.

Users will find the archive link if they really want to, and it will make it easier for me to replace broken links in the future.

rkagerer 17 hours ago [–]

I link to the original, but archive it in both WayBackMachine and Archive.is.

euske 17 hours ago [–]

This is both good and scary idea: for the good part, I'm frustrated enough that some unscrupulous websites (even some news outlets) secretly alter their contents without mentioning the change. I want a mechanism that holds the publisher responsible. At the same time, this is scary because we're basically using one private organization a single arbitrator. (I know it's a nonprofit, but they're probably not as public as a government entity.) Maybe it's good for the time being, but we should be aware that this is a solution that's far from perfect.

anaganisk 17 hours ago [–]

Public "or" a government entity.

uniqueid 18 hours ago [–]

Yeah, that's another problem with the design of the web, and kind of a significant one! Somewhat pointless to link to external documents when half of them won't be around next year.

spqr233 13 hours ago [–]

I made a chrome extension called Capsule that works perfectly for this use case. With just a click, you can create a publically shareable link that preserves the webpage exactly as you see it in your browser.

https://capsule.click

nikisweeting 11 hours ago [–]

Does it use SingleFile under-the-hood? What storage format does this use, is it portable? e.g. WARC/memento/zim/etc?

8bitsrule 18 hours ago [–]

Gotta completely agree ... for anything you need to be stable and available.

I've been building lists of -reference- URLs for over a decade ... and the ones aimed at Archive.org (are slower to load, but) are much more reliable.

Saved Wayback URLs contain the original site URL. It's really easy to check it to see if the site has deteriorated (usually it has). If it's gotten better ... it's easy to update your saved WB link.

not2b 10 hours ago [–]

It's probably better to link to both. If a site corrects a story, you readers will want to see the correction, but if the page disappears, it's good to have the backup.

nullandvoid 15 hours ago [–]

I experienced this just the other day.

I was browsing an old HN post from 2018, with lots of what seemed like useful links to their blog

Upon visiting it the site had been rebranded and the blog entries had disappeared

Waybackmachine saved me in this cass, but a link to it originally would have saved me a few clicks

EllieEffingMae 14 hours ago [–]

I maintain a Fork of a program that does exactly this! You can check it out here

https://github.com/Lifesgood123/prevent-link-rot

Cthulhu_ 16 hours ago [–]

If it's to actually reference a third party source, it's probably better to make a self-hosted copy of the page. You can print it to a PDF file for example. I don't believe archive.org is eternal, or that its pages will remain the same.

ashishb 18 hours ago [–]

I wrote a link checker[1] to detect outbound links and mark dead links, so that, I can replace them manually with archive.org links.

1 - https://github.com/ashishb/outbound-link-checker

ffpip 19 hours ago [–]

The wayback machine has helps me on a daily basis. So many old links are dead.

The other day, I noticed that even old links from the front page of Google and Youtube are dead now. Internet Archive still has them. These were links on the front page of YT. Was very disappointed that even Google has dead links.

tannhaeuser 17 hours ago [–]

The proper way is for a site to expose a canonical link to an article via a meta-link (rel=canonical) if necessary, and then have a browser plugin to automatically try archive.org with an URL generated from the canonical one if it is down.

ique 16 hours ago [–]

Just another reason to have content-adressable storage everywhere, then at least if it changed you’ll know it changed, and if you can’t get the original content anymore then the change is probably malicious.

axelfreeman 17 hours ago [–]

You could link to the original web url and also do a print version of the web content as PDF. That's how i archive howtos and write-ups of interesting content. Print view and create a PDF version.

michaelanckaert 18 hours ago [–]

In the past I would fall back to WBM when something is no longer online. Though recently I've been bookmarking interesting content very rigorously and just rely on the archival feature of my bookmarking software.

hgo 18 hours ago [–]

Maybe the solution isn't technical and we should look at other fields that have relied on referencing credible sources for a long time? I can think of research, news and perhaps law.

scruffyherder 15 hours ago [–]

So it can be deleted too?

Or so there is no engagement at the source?

spurgu 13 hours ago [–]

I think a good solution might be to host the archive version yourself (archive.org is slow, and always using it centralizes everything there).

Let's say you write an article on your site, https://yoursite.com/my-article, and from it you want to link to an article https://example.com/some-article

You then create a mirror of https://example.com/some-article to be served from your site at https://yoursite.com/mirror/2019-09-08/some-article (put /mirror/ in robots.txt and set to noindex (or maybe even better to put a rel="canonical" towards the original article?)) and on the top of this mirrored page you add a header bar thingy containing a link to the original article, as well as one to archive.org if you so want.

tl;dr instead of linking to https://example.com/some-article you link to https://yoursite.com/mirror/2019-09-08/some-article (which has links to the original)

andy_ppp 17 hours ago [–]

It would be good to create a distributed, consensus version (to help stop edits) of the content rather than have a single point of failure...

arnoooooo 18 hours ago [–]

On the same topic, I wish I could link with highlights in the page. Having a spec for highlights in URLS would be neat.

basscomm 13 hours ago [–]

Chrome 80 supports this: https://www.chromestatus.com/feature/4733392803332096

sebastianconcpt 13 hours ago [–]

Clever way to make the reference immutable.

Some blockchain will end up taking care of this.

zoid_ 18 hours ago [–]

I find that web archive pages always appear broken —- perhaps a lot of js or css is not properly archived?

LoSboccacc 11 hours ago [–]

has waybackmachine stopped retroactively applying robots?

if not link to that are one misconfiguration or one parked domain from being wiped.

ponker 4 hours ago [–]

What would be even cooler is if there was an easy way to turn your own server into a Wayback machine, so that when your server rendered a webpage, it would use the original link if available, or its own cached version if not.

eruci 15 hours ago [–]

WBM is like a content snapshot. You can't go back in time and change anything. That's why it is better than linking to the original.

Andrew_nenakhov 17 hours ago [–]

Hmm. is there a place for a service that makes a permanent copy of content, available at the original url at the time of posting?

stratigos 14 hours ago [–]

I link to WayBackMachine as Ive built a great many greenfield applications for startups as a freelancer, which only existed for about 6-8 months before hitting their burn rate. If I linked to their original domains, my portfolio would be a list of 404s.

prgmatic 14 hours ago [–]

I stopped reading after the part where they describe the paywall gated version of the journalism website as "Now it’s spam from a site suffering financial need."

That website spends money creating content for commercial viability, it doesn’t have to bow to you and make sure you can consume it for free, and the Wayback Machine isn’t a tool for you to bypass premium content.

ImAlreadyTracer 13 hours ago [–]

Is there a chrome app that utilises waybackmachine?

drummer 9 hours ago [–]

For anything important you can't beat a good save to pdf feature in the browser. You can then upload the pdf and link to that instead. Someone should make a wordpress plugin to do this automatically.

jakeogh 18 hours ago [–]

If it's not distributed, it is going to disappear.

The waybackmachine is backed by WARC files. It's perhaps the only thing on archive.org that cant be downloaded... well except the original mpg files for 911 news footage.

https://news.ycombinator.com/item?id=20623177

TheSpiceIsLife 17 hours ago [–]

This behaviour should be reported to the WayBackMachine as abuse.

dirtnugget 11 hours ago [–]

He is actually showcasing a very nice technique to get around paywalls: turn off JS. Often enough that’s enough to get around the paywall. I believe the archives also disable JS when grabbing the content.

rchaud 11 hours ago [–]

That is changing. I've noticed over the past couple of years that sites that could be accessed with JS turned off are now showing a "Please enable Javascript to continue" (Quora) or just hiding the content entirely (Business Insider).

I'm sure there are other examples as well.

dirtnugget 10 hours ago [–]

Not surprised. When paywalls started becoming a thing most of them could be circumvented simply by removing a DOM element and some CSS classes. Nowadays this is basically not possible anywhere anymore.

icemelt8 19 hours ago [–]

Just FYI, archive.org is banned in a few countries, including the UAE, where I cannot open any links from there.

dirtnugget 11 hours ago [–]

Huh I wonder if they are also blocking mirrors. Also, in countries with restrictions to internet access you probably want to make using TOR a general habit.

s9w 18 hours ago [–]

In practice however, archive.org did censor content based on political preference.

encom 18 hours ago [–]

Sounds plausible, but I sure would like a citation for that claim.

dependenttypes 16 hours ago [–]

They exclude Snopes and I think Salon from archiving.

s9w 17 hours ago [–]

I do have two links in my "clownworld" link list under, but ironically they're both in reddits that have since been banned and are therefore not available anymore.

k1m 19 hours ago [–]

I think this a good idea, but especially because the WayBackMachine uses good content security policies to prevent some of the intrusive JS ad-dependent sites like to push on people. So you're not only protecting from future 404 scenarios, but also protecting your visitors' privacy from unscrupulous ad-tech which seems to be everywhere now.

The example provided in the article, showing how a site looked cleaner before, could simply be the content security policies at the WayBackMachine preventing the clutter from getting loaded, rather than any specific changes on the site - although I haven't checked that particular site.

Applications are open for YC Winter 2021

Search:

https://news.ycombinator.com/item?id=24406193