Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A while back, there was this pie-in-the-sky idea which was really interesting but not too practical, called Semantic Web. It didn't really pan out because it turns out that annotating your sites with metadata is boring and tedious and nobody really liked to do it, and anyway, search and Bayesian statistics simulated the big ideas of Semantic Web well enough for most people.

The ideas behind it still stand, though, in the idea of microformats. These are just standardized ways of using existing HTML to structure particular kinds of data, so any program (browser plug-in, web crawler, &c) can scrape through my data and parse it as metadata, more precisely and with greater semantic content than raw text search, but without the tedium that comes with ontologies and RDF.

Now, these ideas are about the structured exchange of information between arbitrary nodes on the internet. If every recipe site used the hRecipe microformat, for example, I could write a recipe search engine which automatically parses the given recipes and supply them in various formats (recipe card, full-page instructions, &c) because I have a recipe schema and arbitrary recipes I've never seen before on sites my crawler just found conform to this. I could write a local client that does the same thing, or a web app which consolidates the recipes from other sites into my own personal recipe book. It turns the internet into much more of a net, and makes pulling together this information in new and interesting ways tenable. In its grandest incarnation, using the whole internet would be like using Wolfram Alpha.

The #! has precisely the opposite effect. If you offer #! urls and nothing else, then you are making your site harder to process except by human beings sitting at full-stack, JS-enabled, HTML5-ready web browsers; you are actively hindering any other kind of data exchange. Using #!-only is a valid choice, I'm not saying it's always the wrong one—web apps definitely benefit from #! much more than they do from awkward backwards compatibility. But using #! without graceful degradation of your pages turns the internet from interconnected-realms-of-information to what amounts to a distribution channel for your webapps. It actively hinders communication between anybody but the server and the client, and closes off lots of ideas about what the internet could be, and those ideas are not just "SEO is harder and people can't use curl anymore."

I don't want to condemn experimentation, either, and I'm as excited as anyone to see what JS can do when it's really unleashed. But framing this debate as an argument between crotchety graybeards and The Daring Future Of The Internet misses a lot of the subtleties involved.



Very interesting points, but there are couple of errors which undermine part of your point: 1. If the application follows the Google proposed-convention or similar, the crawler doesn't need a full-stack JS implementation; it just needs to do the (trivial) URL remapping. 2. Nothing in this hash-bang approach requires a HTML5-ready browser.


I tried both curl and wget last night (neither of these are HTML5-ready browsers), and neither of them could get content using the hash-bang URL. They both came back with an empty page skeleton.

Also, how do you reassemble the hash-bang URL from HTTP Referrer header?


Neither curl nor wget follow the Google convention for handling hashbangs as suggested by the parent, so I'm not sure what you're getting at with this reply.


Hash-bang URLs are not reliable references to content - that's what I am getting at. Curl and WGet are perhaps the most used non-browser user-agents on the web. And both of them are unable to retrieve content at a URL specified by a hash-bang URL.

In this context hash-bang urls are broken.


I'm sorry if I implied that curl/wget handle this already. However, they could handle this with a very small wrapper script, maybe 3 lines of code, or a very short patch if the convention becomes a standard. That's not nothing, but it's maybe 7 orders of magnitude lighter than a full JS engine, and it's small anyway compared to the number of cases that a reasonable crawler needs to handle.

Also, with that wrapper or patch, curl & wget will still not be remotely HTML5 ready, which I hope demonstrates that HTML5 is not a requirement in any way. A single HTML5-non-ready browser that can't handle this doesn't mean therefore that HTML5 is a requirement.


They aren't? You're only supposed to use them if you follow Google's convention, in which case they should be reliably replaced with a normal URL sans the hash. Of courses your scraper must be aware of this, but it should be a somewhat reliable pseudo-standard (and it is just a stopgap after all).


We're talking about different internets, though. You're talking about the hypothetical patched internet that uses Google's #! remapping, whereas I'm talking about the internet as it exists right now. If I go to Gawker with lynx right now, it will not work, period. The fact that there exists the details of implementation somewhere—and the fact that the implementation is trivial—doesn't mean that it should become standard across the board.

I hate to invoke a slippery slope, but it seems a frightening proposition that $entity can start putting out arbitrary standards and suddenly the entire Internet infrastructure has to follow suit in order to be compatible. It's happened before, e.g. favicon.ico. All of them are noble ideas (personalize bookmarks and site feel, allow Ajax content to be accessible) with troublesome implementation (force thousands of redundant GET /favicon.ico requests instead of using something like <meta>, force existing infrastructure to make changes if they want to continue operations as usual.)

All of this is moot, of course, if you just write your pages to fall back sensibly instead of doing what Gawker did and allowing no backwards-compatible text-only fallback. Have JS rewrite your links from "foo/bar" to "#!foo/bar" and then non-compliant user agents and compliant browsers are happy.


> If I go to Gawker with lynx right now, it will not work, period.

As a specific issue, that seems like a minus, but an exceedingly minor one, as lynx is probably a negligible proportion of Gawker's audience. In principle, backwards-compatibility is a great thing, until it impedes some kind of desirable change, such as doing something new or doing it more economically.

> it seems a frightening proposition that $entity can start putting out arbitrary standards

I generally do want someone putting out new standards, and sometimes it's worth breaking backwards-compatibility to an extent. So it really depends on $entity: if it's WHATWG, great. If it's Google, then more caution is warranted. But there's been plenty of cases of innovations (e.g. canvas) starting with a specific player and going mainstream from there. I do agree that Google's approach feels like an ugly hack in a way that is reminiscent of favicon.ico.

> All of this is moot, of course...

This is good general advice, but it's not always true. At least one webapp I've worked on has many important ajax-loads triggered by non-anchor elements; it's about as useful in lynx as Google maps would be. The devs could go through and convert as much as possible to gracefully-degrading anchors, that would at least partly help with noscript, but it seems like a really bad use of resources, given the goals of that app.


Ah, but the #! is probably just using JS to access a well-defined API - the same API which anyone else can access in completely uncluttered, machine-readable form.

So perhaps the solution is for every #! page to have a meta tag pointing to the canonical API resource which it is drawing data from. Bingo, semantic web!


You also have to ensure every relevant site (in this example, every site that would have used hRecipe) uses the same API scheme.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: