A while back, there was this pie-in-the-sky idea which was really interesting bu...

aamar · on Feb 10, 2011

Very interesting points, but there are couple of errors which undermine part of your point: 1. If the application follows the Google proposed-convention or similar, the crawler doesn't need a full-stack JS implementation; it just needs to do the (trivial) URL remapping. 2. Nothing in this hash-bang approach requires a HTML5-ready browser.

Isofarro · on Feb 10, 2011

I tried both curl and wget last night (neither of these are HTML5-ready browsers), and neither of them could get content using the hash-bang URL. They both came back with an empty page skeleton.

Also, how do you reassemble the hash-bang URL from HTTP Referrer header?

wahnfrieden · on Feb 10, 2011

Neither curl nor wget follow the Google convention for handling hashbangs as suggested by the parent, so I'm not sure what you're getting at with this reply.

Isofarro · on Feb 10, 2011

Hash-bang URLs are not reliable references to content - that's what I am getting at. Curl and WGet are perhaps the most used non-browser user-agents on the web. And both of them are unable to retrieve content at a URL specified by a hash-bang URL.

In this context hash-bang urls are broken.

aamar · on Feb 10, 2011

I'm sorry if I implied that curl/wget handle this already. However, they could handle this with a very small wrapper script, maybe 3 lines of code, or a very short patch if the convention becomes a standard. That's not nothing, but it's maybe 7 orders of magnitude lighter than a full JS engine, and it's small anyway compared to the number of cases that a reasonable crawler needs to handle.

Also, with that wrapper or patch, curl & wget will still not be remotely HTML5 ready, which I hope demonstrates that HTML5 is not a requirement in any way. A single HTML5-non-ready browser that can't handle this doesn't mean therefore that HTML5 is a requirement.

wahnfrieden · on Feb 10, 2011

They aren't? You're only supposed to use them if you follow Google's convention, in which case they should be reliably replaced with a normal URL sans the hash. Of courses your scraper must be aware of this, but it should be a somewhat reliable pseudo-standard (and it is just a stopgap after all).

andolanra · on Feb 10, 2011

We're talking about different internets, though. You're talking about the hypothetical patched internet that uses Google's #! remapping, whereas I'm talking about the internet as it exists right now. If I go to Gawker with lynx right now, it will not work, period. The fact that there exists the details of implementation somewhere—and the fact that the implementation is trivial—doesn't mean that it should become standard across the board.

I hate to invoke a slippery slope, but it seems a frightening proposition that $entity can start putting out arbitrary standards and suddenly the entire Internet infrastructure has to follow suit in order to be compatible. It's happened before, e.g. favicon.ico. All of them are noble ideas (personalize bookmarks and site feel, allow Ajax content to be accessible) with troublesome implementation (force thousands of redundant GET /favicon.ico requests instead of using something like <meta>, force existing infrastructure to make changes if they want to continue operations as usual.)

All of this is moot, of course, if you just write your pages to fall back sensibly instead of doing what Gawker did and allowing no backwards-compatible text-only fallback. Have JS rewrite your links from "foo/bar" to "#!foo/bar" and then non-compliant user agents and compliant browsers are happy.

aamar · on Feb 10, 2011

> If I go to Gawker with lynx right now, it will not work, period.

As a specific issue, that seems like a minus, but an exceedingly minor one, as lynx is probably a negligible proportion of Gawker's audience. In principle, backwards-compatibility is a great thing, until it impedes some kind of desirable change, such as doing something new or doing it more economically.

> it seems a frightening proposition that $entity can start putting out arbitrary standards

I generally do want someone putting out new standards, and sometimes it's worth breaking backwards-compatibility to an extent. So it really depends on $entity: if it's WHATWG, great. If it's Google, then more caution is warranted. But there's been plenty of cases of innovations (e.g. canvas) starting with a specific player and going mainstream from there. I do agree that Google's approach feels like an ugly hack in a way that is reminiscent of favicon.ico.

> All of this is moot, of course...

This is good general advice, but it's not always true. At least one webapp I've worked on has many important ajax-loads triggered by non-anchor elements; it's about as useful in lynx as Google maps would be. The devs could go through and convert as much as possible to gracefully-degrading anchors, that would at least partly help with noscript, but it seems like a really bad use of resources, given the goals of that app.

ladon86 · on Feb 10, 2011

Ah, but the #! is probably just using JS to access a well-defined API - the same API which anyone else can access in completely uncluttered, machine-readable form.

So perhaps the solution is for every #! page to have a meta tag pointing to the canonical API resource which it is drawing data from. Bingo, semantic web!

jarek · on Feb 10, 2011

You also have to ensure every relevant site (in this example, every site that would have used hRecipe) uses the same API scheme.