A nice set of challenges -- kitty at a school with tens of thousands of bucks a year or less immediately.
---
Also, a word count on patio11's submissions: 1,052,351. For comparison, all 7 Harry Potter books total 1,084,170 words. patio11 has written the entire Harry Potter series worth of content on HN. Just... wow.
It would be interesting to redo the top karma list by comment karma only. It would also be interesting to look at liberal vs conservative political sentiment over time. I feel like the site has gotten more liberal over the years, especially after all the Chelsea Manning stuff, but not sure if that is accurate.
I also included the number of thread wins (user made comment with highest number of points in a given submission thread) to see if there was any unusual relationship. There wasn't: correlation is 0.89. I ended up not including it in the article because it's long enough as-is. :P
Thanks! I would have thought that I'd have been higher ranked when looking at only comments, although actually my rank is almost the same. Interesting to see that many of the other top users are using the site in more or less the same way.
SELECT
author,
SUM(num_points) AS total_comment_karma,
MAX(created_at) AS last_comment
FROM hn_comments
GROUP BY author
ORDER BY total_comment_karma DESC
LIMIT 1000
We seem to think that time spent learning is lost labor, rather than that time spent laboring is lost time for learning, although we know that both learning and laboring are required for productivity, and that learning is capex.
Thanks, I had been curious about that number for a while. The last time I checked it was 500k or so.
For folks who want to do interesting things with the API but don't want to be abusive to Firebase's servers, I whipped up a quick ruby script to cache a particular user's comments/submissions on disk: https://gist.github.com/patio11/1550cad3a02edd175049
It tries to rate limit itself by putting 200ms of sleep between requests, so downloading all of my comments would take ~30 minutes.
"I release this work unto the public domain." -- feel free to adapt it to your needs.
Usage is "ruby slurper.rb $USERNAME $MAX_COMMENTS_TO_FETCH."
Thanks for being an outrageously good resource and beacon of inspiration! You've unknowingly been one of the most influential role models in my career/life: I just relaunched one of my side projects as SaaS last month and it's succeeded beyond my wildest expectations (already at ~$8k YRR). Hopefully I can follow your trajectory and never have to actually work another day in my life :)
Congratulations on the success. Nothing in business makes me as happy as folks telling me that what I wrote/did/etc helped them out.
Though I don't know if I'd describe my lifestyle as "never having to actually work another day in my life." It feels less like work some days and more like work others. For example, it is 1:30 AM and while I could be snug in my bed I am instead clearing out the AR support inbox. (Poor planning earlier today, but still.)
"Actual work" meaning Japanese salaryman/having-a-boss-that-tells-you-what-to-do-and-when-to-do-it. I think I'm still in the honeymoon phase of customer support: I still get little rushes of adrenaline even for angry complaint emails ("something that I've created has provided so much utility for someone that they're angry when it doesn't").
It's certainly difficult at times, but when you can set your own hours, do wherever you want on whatever you want, and take as much time off as you want for any reason, it's difficult to justify using the W word.
I wasn't going to comment, but "mhartl" looked familiar. I just wanted to say "thank you" for your Rails Tutorial; I don't think I would have ever learned to program without it. It literally changed my life.
The book is copyrighted by Ed Weissman, so although it's possible "Ed" is short for "Edna," the higher probability can be assigned to "his" in this case.
I'm speechless. Honestly, the content is so expansive and valuable. I love the internet, hackernews, physics, code, knowledge, and the tiny things in between :)
This... is cool, but also kinda sucks for me. I've invested dozens of hours into writing an extremely complicated scraper for my Android version of HN.
The newest version (still under development, probably a month or two from release) adds support for displaying polls, linking to subthreads, and full write support (voting, commenting, submitting, etc). I'm fine with switching to a new API (Square's Retrofit will make it super easy to switch), but without submitting, commenting, and upvote support I have to disable a bunch of features I worked really hard on. Also it would've been cool to know this was coming about 3 months ago so I didn't waste my time.
Anyways, quick question on how it works -- when I query for the list of top stories
I'm sorry you just invested a lot of time in scraping. I know from experience what a pain that is. We said several times that the API was coming, and I've made it clear to anyone who asked, but there's just no way to reach everybody. All: in the future, please get answers to questions like this by emailing hn@ycombinator.com.
Re write access and logged-in access, if that turns out to be how people want to use the API, that's the direction we'll go. But we think it's important to launch an initial release and develop it based on feedback. There are many other use cases for this data besides building a full-featured client: analyzing history, providing notifications, and so on. It will be fascinating to see what people build!
I'm not blaming you. It just feels bad, you know? I'll definitely email you in the future about stuff like this. And don't get me wrong, it will be great to be able to throw out the cruft that comes along with parsing the current layout. The app is engineered to be able to drop in a new API pretty quick since I thought something like this would happen eventually.
It would help me out a lot if the current front end would live on under oldnews.ycombinator.com like that until the new API has write access, though. I think it's pretty cool to be able to be reading an article somewhere else, click "Share" in Android and have "Submit to HN" pop up in the results.
I second that request. Having a subdomain point to the current layout for a little longer is definitely going to help the transition, especially for write access and platforms without Firebase SDKs.
This... is cool, but also kinda sucks for me. I've invested dozens of hours into writing an extremely complicated scraper for my Android version of HN.
This definitely does suck. I feel your pain. But it's also part of the package of scraping websites. You go in knowing that it could break at any time.
Oh, I'm well aware. I've had to push many quick fixes when some field gets renamed, etc. It's really not the API change that bothers me, more the lack of features. But hopefully they can add those things soon and I can re-enable them down the road.
Yes. While with HTTP pipelining you can request them all over a single TCP connection using a single SSL session, you will need to make an HTTP request for each item you want.
If you're on a supported platform, the Firebase SDKs handle all this efficiently and can even provide real-time change notifications.
I'm trying to attach a ChildEventListener to the "item" Firebase and I'm getting a "permission denied" error. My guess is that I am doing something wrong, but on the off chance that the adding event listeners is not (yet) enabled, it would be nice to know. Any clues to what I might be doing wrong?
I've never used the Firebase API itself before. It's very clean!
Edit: I reached the same (now obvious) conclusion as mentioned in the reply below. Now my quick hack is working perfectly. Thank you so much for this!
[Firebase Dev Advocate] Glad you're enjoying Firebase! Attaching a listener to the "items" Firebase is disabled. This is because it would send every item from HN to your computer. You'll need to attach a listener to the individual item instead. The "permission denied" error is coming from the security rules on the HN Firebase (https://www.firebase.com/docs/security/quickstart.html). If you're trying to find out what the latest updates are, they're kept in the /updates node (https://github.com/HackerNews/API#changed-items-and-profiles).
I'm also currently writing a scraper[1] for the HN frontpage (for my WIP Hacker News redesign), and while there's a limited Algolia API available, it doesn't do much good if users can't post comments, upvote etc. Same goes for the official one now.
So, @anyone involved with the API project, can you give us an estimate on when will the OAuth-based user-specific API be rolled out? I'm fining with pausing my efforts until then, if it's going to be soon, in order to go a less complex and error-prone path.
[Firebase Dev Advocate]
@airlocksoftware - Yes, you should make separate requests for each story. You can attach a listener to the topstories node (https://www.firebase.com/docs/web/guide/retrieving-data.html...) and when that’s triggered, you can make a request for the data on each story. Using the Firebase SDK, each request will get made using the same connection. I'd recommend using our SDK instead of the REST API so you don't have to worry about managing your own connections and retries.
Here's an example showing all topstories and updating in realtime. Obviously, in JS, but the other Firebase SDKs are similar: http://jsfiddle.net/firebase/96voj1xh/
Just wanted to drop a comment on the awesomeness of your app.
Hacker News 2 is by far the best Hacker News app, not just on Android, but on all mobile platforms i've tried (so, iOS, Android and Windows Phone)
Awesome work you are doing.
Yeah, I like it a lot, but I've put tons of time into my scraper for Reader YC (https://github.com/krruzic/Reader-YC). I support everything but polls currently. This api is nice but my scraper actually supports more... No option to get Show HN, Ask HN or New afaik. Still glad this is out!
Exactly. Is this really the case, or it just isn't documented? I've send an email to api@ycombinator.com about that and hopefuly, I will be able to shed some light on this, later. I will write as soon as I get a response. (assuming someone responds)
Oh, thanks for the reminder. I fixed it a few days ago and did a staged rollout but forgot to push it to everyone. I've done that so it should update for you soon.
Ha! A) Because I love Double-Doubles. B) Because it's more than 2 year old (before there really was a hamburger icon on Android). It's completely redesigned in the next version, though.
So why, in the first place, would I want another mobile app rather than just opening the fully functional website (which is pretty simple & basic already) on my mobile browser?
Because it can be better designed, use common design / navigation patterns of your mobile OS, notify you when you get a reply, change the text size, change the theme, have richer animations, and allow you to automatically share content from other applications directly to HN?
Yeah. I'm just a bit averse to apps scraping data for the reasons you mentioned. It should have been the work of the mobile website, not an app. Speaking purely from the user's point of view (not the developer - I realize this is a community full of app developers) - one can't just keep installing apps for every website which is not mobile efficient yet. You all must have seen a lot of websites showing messages like "Welcome, we have an app, pess OK to install that, or Cancel to continue". Most of those websites don't do anything which a mobile website couldn't.
I agree with you in theory, but most mobile websites are poorly thought out and implemented - if at all. I definitely don't download apps for every site I use, but for the ones I use daily, I generally find I need to. Native OS interactions seem to be difficult to get right in the browser.
HN is definitely an example of a site that isn't ideal in a mobile browser. For instance, if you have the ability to downvote, it's incredibly easy to mistakenly downvote when you mean to upvote because how close and small the buttons are. There's other added functionality, like tracking who I've upvoted / downvoted in the past as well as tracking un/read comments when returning to a thread. In the browser, I use a chrome extension for this, and on my phone, I use airlocksoftware's app. (side-note, I wish said state carried between the extension and the app)
The developers of HN are surely capable of creating a mobile website that could work just as well, or even better than a mobile app. But currently, it's not ideal. And for that reason, I completely appreciate airlocksoftware's (and the devs of other HN apps) for their efforts.
Ideally those in charge of HN would have simply employed someone - they had offers starting at free I gather from previous threads - to make HN mobile friendly. Failing that apps are just patching the original site. I too decry this form of progress (replacing web access with apps that only fix the borked site) but it's not hard to see why people should want that.
I've been working on a Hacker News client for Windows Phone over the past several weeks and am very close to an initial release, so I feel somewhat ambivalent about this.
On the one hand, of course it's great that HN is finally getting a proper API and also modernizing its markup (which is a mess even if you ignore all the tables – for example, the first paragraph in a comment usually isn't wrapped in <p> tags), but on the other hand this current v0 version is very lacking and impractical for a regular client application.
Since the top stories (limited to 100) and child comments are only available as a list of IDs a client app would have to make a separate HTTP request for every single item, which is obviously not something you'd want to do especially in a mobile environment. Other lists apart from the top stories (new, show, ask, best, active etc.) don't seem to be available at all right now.
Of course this is just the first version, and the documentation promises improvements over time – which I don't doubt at all – but there's no clear indication that the API will be at feature-parity with the current website, even excluding anything that requires authentication, by October 28. So this means that I – and other developers of client apps or unofficial APIs – will probably have to write new scraping code once the new rendering engine (which I assume refers to the website) arrives instead of being able to switch to the new API immediately.
Now I guess I might just be needlessly worried, especially since the blog post explicitly says that the new API "should hopefully making switching your apps fairly painless", but then why not wait until it's actually ready for that before making the announcement? Putting a half-baked API out there a few days/weeks (?) in advance before it's fully fleshed out doesn't seem all that helpful, at least to me.
Use the Firebase libraries rather than the REST one to efficiently handle requests. I believe it uses a websocket internally. "It does all the work for you and is awesome." to quote Nick.
[Firebase founder] There is a Firebase C# SDK on the way. We've had some other things, that we've been working on for the last year, that are shipping in the next few weeks which have taken priority. After that, we'll be shifting focus to new SDKs (they're a little complicated and take a bit of time to build)
I'm really glad to hear this. I have been loving Firebase for the app I'm making but one of the components has to run in C# and talking to Firebase from the C# app was much more painful than the other Firebase portions of the application.
This would be awesome. Right now I'm experimenting using raw HTTP requests and Newtonsoft.Json as the JSON [de]serializer. I presume that you will make your C# SDK a portable class library so we can use it in iOS and Android apps as well via Xamarin?
"C# SDK" doesn't say much. SilverLight can use C#, WinRT can use C#, .NET can use C#, but that doesn't mean your "C# library" will run everywhere. C# is just the syntax.
Are you referring to the WebSocket API that the Firebase SDKs use internally? It doesn't seem to be documented anywhere so I guess it's only slightly better than scraping HTML ;)
Thanks for the tip, I actually just figured that out myself a few minutes ago. Should be good enough until a proper SDK arrives.
With access to a Firebase SDK the only major additions the API needs to become a viable replacement for existing read-only client apps would be support for all the other lists apart from top stories (new, show, show new, ask, jobs, best, active) and more than 100 items for each. For apps that need write access I'd suggest keeping the current website on a separate subdomain until that is implemented into the API.
Didn't they remove support for calling COM APIs (i.e. ActiveX) in Chakra? (At least in IE 10 and later, I think - the versions with proper EcmaScript 5 support.)
What you're getting there is how the comment text happens to be stored (and presumably always has been). We've talked about changing that, because it would allow us to do some implementation improvements like... well, I forget just now. Might have had to do with caching. Anyhow, if enough people want it, we'll bump up the priority.
I've tried your app, it's functional but obviously I wasn't satisfied with any of the existing options which is why I wrote my own ;) It should be coming out this week, just a few finishing touches now...
Would you be willing to beta test my application? It's pretty much finished already and doesn't have any of the issues you listed (in addition to having a, in my opinion, much nicer design), I'd just be looking for some general feedback regarding the discoverability of some features. I could submit a beta today or tomorrow.
Can't get it to work, e.g., for `hn.top_stories()` I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "hackernews.py", line 23, in top_stories
return r.json()
TypeError: 'list' object is not callable
Tried in both python 2.7.3 and python 3.2.3
EDIT: You need a relatively new version of requests for this to work. The version packaged with Debian Wheezy is too old. Use pip.
(As I find myself pondering the idea of standing something up like this on an dual-stacked server purely so that I could access HN from my IPv6-only test network... hmmm...)
dstaley nice work. Would be nicer still if firebase had SPDY and QUIC lit up. I don't think it would be a problem with Apache already in front of Varnish.
To everyone asking about logged-in access and write access: this is just a first release! Where it goes from here will depend, in good iterative fashion, on what people want.
I think part of the angst centers around the bit where the markup is going to change. Lots of apps for HN that scrape the UI rely upon the current site to enable things like voting and other signed-in-only actions, for which there is (currently) no first-class API way to do these things. Even if the endpoints for voting change, getting other information (i.e. which items you voted on, so the vote arrows hide) is still markup-dependent.
How does this differ from the Algolia HN API in terms of data access? (https://hn.algolia.com/api) I was able to download all HN data recently with ease using that endpoint. Authentication?
EDIT: After looking at the documentation there are two new aspects of the Firebase API not in the Algolia API:
1) Ability to see deleted/dead stories.
2) Endpoint for user data.
Question to kogir/dang: Has the "delay" field (Delay in minutes between a comment's creation and its visibility to other users) always been there?
I'm also curious whether it removes some of the limitations of the Algolia version; I wanted to download my content for some statistical analysis (notes at http://www.gwern.net/HN ) but I discovered that it seems there's some hard limits to how much of my data I can reach: https://github.com/algolia/hn-search/pull/36
If you want to get data for a single user through Algolia's API via the commandline, you could also use https://github.com/jaredsohn/hnuserdownload. It uses the same technique as minimaxir's code (a post of his was the inspiration.)
I don't know Python, so I'm not sure what your source code is doing. At a guess, you've hacked together some sort of repeated queries thing with a time-window?
"Every 2 minutes, the last 1000 HN items (stories, comments, polls) are sent to Algolia's indexing API. Items from the last 48 hours are refreshed every hour."
Correct, and we're planning to move to the new official API ASAP (instead of the legacy one -> it was not web crawler but far from perfect) for the indexing.
Regarding the REST APIs, let's keep both for now :)
[Firebase founder here] This is pretty exciting for us, we're glad kogir, dang, kevin and sctb chose to expose HN's data through Firebase. We're seen quite a few startups (and big companies like Nest) do this, since building, maintaining, and documenting a public API often isn't a easy task.
This makes it really easy to add average karma to the comment section for every user. For instance, you can paste the below into the console, and should add average karma data for each user.
Here is something I built with the Algolia API awhile back and just haven't gotten around to cleaning it up to post here.
It lets you download all comments/stories for a user as a JSON or CSV file, breaks down karma between comments and stories, and plots comment/story counts, karma, etc. over time on a line chart (clicking will show you the details via an hnsearch).
Also I built some npm modules so you can get this information via the commandline.
I really appreciate giving a 3 week heads up before moving to a new frontend structure. It's a nice gesture, but I have this horrible feeling that there's only about a 10% chance that my Hacker News app gets updated in time.
I know you can't not iterate because people are scraping, but it does stink. At least this will make everything more future-proof going forward.
However, it may be nice to give a bit more heads up than 3 weeks. I know a lot of apps can take ~2 weeks to get through the review process for iOS.
Some sites actively punish scrapers by constantly, purposely changing their markup. So giving them an API and 3 week head start is leaps and bounds above and beyond what can be expected. When you operate a scraper, you are always on the defensive when it comes to site updates suddenly breaking your app. They should be so lucky to get this 3 week notice. It is all on them if they can't turn it around in time.
As someone who poured countless hours into meticulously scraping the HN markup and faces the prospect of having to port all my code with dread, I'll probably be pleading for an extension alongside you.
A preview release / staging version would help those of us with scrapers update it, without having a so much downtime / scramble when it's finally released.
Maybe they can put the new renderer on a different sub domain for a few weeks after their 3 week deadline. One to beta test the interface and two to give devs a bit more time to convert.
We'll probably start with logged out home to a percentage of traffic and work our way to updating the rest of the site and handling the full traffic over a few weeks.
If people need help converting their apps, I believe the Firebase guys have offered to help. Contact us at api@ycombinator.com and we'll connect you.
Love the staged roll out idea. You could probably also use user agents to recognize scrapers from real users (could also be helpful for slowly rolling out to older browsers).
I've built a library for iOS (https://github.com/bennyguitar/libHN) that handles scraping, commenting, submitting, voting, etc pretty well and allows me to make as few web calls as necessary to use HN. It looks like I'd have to drop functionality and completely change the networking scheme to match this API - something I'm not willing to do yet.
Correct me if I'm wrong here, but to get every comment on a post, I'd have to recursively get each item for each child. Instead, right now, I can make one network request and get all comments for a story. Granted, I have to parse the HTML (which I hate), but it's a much cleaner solution than going through every item, checking the children and then getting those items ad infinitum. Again, I just glanced over the documentation, but that seems untenable to me.
"Most importantly, the reason we released an API is so that we can start modernizing the markup on Hacker News. Because there are a lot of apps and projects out there that rely on scraping the site to access the data inside it, we decided it would be best to release a proper API and give everyone time to convert their code before we launch any new HTML."
Yeah, the only problem is that I don't want to cut major functionality just to use their API. Things I do, that it doesn't look like the API handles:
- More than 100 stories
- Best, Top, Ask, ShowHN, Jobs, User Submission Posts
- User Management (logging in/out)
- Commenting
- Submitting
- Voting
My app doesn't just function as a reader, which is what this API seems geared towards with the v0 release, it functions as about as full-fledged of an HN client as you could get. There's a couple things that I haven't built in yet like changing your about me text, but those were on the roadmap.
I'm actually thinking about storing the configuration of how my app scrapes online such that if the HTML markup changes, I won't have to push huge sweeping changes to the App Store to get my app online again. I just deploy to Heroku and the app will handle that configuration and scrape correctly sans pushing to Apple.
I welcome the idea, but this barely qualifies as an API. The most useful part is the "current top stories" - but what timeframe exactly? Seems to be over 3 days at least and can't be customized. And even my test parsing of the 100 top stories took a good minute.
And that returns only the ids, nothing else. To get basic information like the score, title or url you have to lookup the ids individually. And even the story items do not contain such basic information as the number of comments. And you can't calculate it yourself since only the top comments are even returned (as ids of course). So you'll have to recursively dig through the comments to get the number.
This is even more curious as there is a very solid Algolia API where you can filter for submission time, story score, number of comments and even return a greater number of results + access page numbers to get even more.
To get the information of a single algolia api call you will need hundreds or thousands (in case of nested comments) "official" API calls. Hoping for updates
If up/down vote data were included in the API, much needed experimentation on collaborative filtering would be made possible! This is Hacker News after all.
Right now one team, Ycombinator, is trying to fix important issues in the ranking and moderation of posts and comments. Many of us are frustrated by the increasing domination of popularity (and hatred) over quality and relevance. A lot of good submissions and comments are simply buried, never to be found. There is too much muck to have to wade through. The timing of posts and comments plays a much larger role than quality. I could go on and on.
Imagine a Netflix Prize-like flowering of experiments and collaboration, leveraging the hacker community's collective smarts and enthusiasm. Many of us have ideas, but right now are unable to test them. What a shame if a great idea dies on a notepad.
There are two possible issues with opening up voting data: gaming and privacy. If having vote data allows someone to game the front page, then only include it with some delay (2 days?) so that it could't be used to game the front page. This will still allow experimentation with collaborative filtering algorithms and the like.
My take on the privacy issue is that anonymity isn’t that important for a site like Hacker News:
1. Startup culture is about straight talk, putting your money where your mouth is, and open critical feedback, both in the giving and receiving. There are precedents for exposing voting data (e.g. Quora, Facebook, Stack Exchange).
2. HN is not aimed at political discussions or other topics where anonymity can be paramount.
3. Pseudonymity is sufficient for those who don’t want their votes and comments tied back to their actual identity.
Thoughts?
I would love to hear from others who yearn to experiment with alternate algorithms and strategies for improving Hacker News.
There are many legitimate views on this, but FWIW mine differs from yours. I believe that anonymity actually is important for a site like Hacker News, and the odds of us ever publishing the vote data—even pseudo-anonymized—are small. Sorry to disappoint.
Daniel, I understand. Do you or any others at Y-Combinator have any thoughts on how the hacker community could experiment in the areas I mention above, or whether you guys even think such experimentation would be valuable?
I built a scraper around 3 years ago (been through a few usernames since then), and I've had to change it once 3 months ago because the HTML output added quotes around HTML attributes.
Even though it's read only, I'll continue to use my scraper rather than the API simple because it's one request, rather than the API would require one request for the top IDs and then one call per story, so it would be 31 calls instead of just 1.
Unless I'm missing something, it seems fairly poorly designed for top stories, and non existent for new stories.
------
EDIT: Looks like I missed the text about updating to a new rendering system in 3 weeks time, and to iterate designs faster to allow mobile friendly theming. Looks Like I WILL be updating to use the API
yeah, I just have the same problem here... and then I have basically the same question as someone mentioned below... new stories through the api? do we have to get the max-id and then get everything below the max-id and check if its a story? and other ideas?
Yay! I've been wanting something like this to come out. I've been playing around with some new tech stacks and built a css replacer of hacker news, but always wanted an actual api to make it easier.
There's a bunch of css pages that come out for hacker news, but I couldn't find anything that aggregates them. This will be alot easier to extend and customize the site.
I'm not seeing any api's for the jobs or show sections though? Hopefully this might come in the future?
Well personally, I didn't want to install any browser add-ons. I also had some other ideas, like aggregating reddit and hackernews posts, but would need to scrape for that (unless there's an external api I'm missing).
The Firebase JavaScript library makes make this impressively straightforward to use. I built a clone using React.js and Firebase's library. Because v0 of the API requires a request for each news story, it's not possible to use Firebase's React mixin yet.
Here's another React version which also does comments (and allows you to fold them - I needed to write a userscript for that before!). It's using react-router to switch between top stories, comments, individual comment and user profile pages.
I've just gone for it with Firebase's React mixin, binding everything as an object, since their devs in this thread don't seem concerned about rate limiting. The mixin seems to throw an error every time I try to unbind, which I'm just catching and logging for now.
Edit: I just watched this comment pop up live in my version - pretty neat :)
I'm definitely excited about the API and the future possibilities with it. Looks like a great start. I do have a few questions and suggestions, though.
Is there any chance of getting more than just the top 100 stories returned? I think it will be a lot more useful for api consumers if you can use a query parameter to set the limit (within reason, usually 1,000) and a number of results to skip. For now, scraping is still more desirable to me since I can retrieve any number of results in their current order.
Better yet, but more complex: a number to skip and a certain timestamp so I don't see the same article on two pages due to natural upvoting, downvoting, or rank decay.
Also, if there's any flexibility still with property names, I'd suggest these changes for clearer semantics:
"deleted" -> "hidden" (since they're obviously not deleted)
"by" -> "author" (for more clarity)
"kids" -> "children" (the common convention)
Please do allow other sites to use HN logins. Then the community could develop useful sister services.
For example, a site where HN members can upvote and rate different development tools, libraries, IDEs, management tools, etc. All with backlinks to HN discussions. It's a great community and there are many ways we could share knowledge and experience.
I'd rather use the REST API directly, for what I need is rather simple and not downloading, installing and maintaining an SDK is more appealing. (My app was developed a while ago and was doing HTML scraping, but the 30-second limit on HN killed it, because of testing -- I don't need to query more often than twice per minute, but while testing I ran the thing a little too often).
So, what are the limits on the REST API, and how do limits work? (A max number of requests per hour would be better than per minute for example).
Here's a simple example that displays the top story and votes using the Firebase JS SDK (and updates in realtime): http://jsfiddle.net/firebase/cm8ne9nh/
Suggestions for improving the API, to make it more valuable for data mining and analytics. This assumes more historical data is available.
1. Provide a way to bulk download the data (that's a click, instead of scraping the API)
2. Add a field for the maximum position a story reached on the front page
3. Add the numerical score for the comment (at least on comments that are N days old, which won't interfere with the reason to hide the scores on the main comments)
Some other changes that would be awesome (but are less realistic) include:
4. A historical event log of votes (even better would be relating those votes back to users, but I imagine that's not going to happen for privacy reasons. An intermediate possibility would be a vote log connected back to anonymized user ids, assuming the anonymous id -> real user id mapping is difficult)
5. A historical event log of display position changes for stories & comments
6. An event log of pageviews with as much metadata as possible to release without infringing on privacy
Would it be possible to cache the number of comments a story has? Or am I wrong in my understanding that the only way to find the number of comments a story has is to walk the tree of child items and maintain my own count?
So, does it mean I can get the top stories, and then get a top story item with all the comment expanded ? I mean, at first it look like it just send me the id and I need to fetch the detail for each of them. Again, this is just looking at the REST API, not the iOS SDK for example.
I'll need to "convert" SwiftHN (https://github.com/Dimillian/SwiftHN) either to this new API or adapt my scrapping engine to the new site layout.
You're using the app, or the scrapping engine (Hacker Swifter)?
The nice thing is that HackerSwifter public API is already in a quite finished state for the available functions, even if I switch to the API, the method calls will be the same.
I am actually using the app because I am doing a mashup of HackerNews and another piece of software. If I decide to end up changing the UI a lot (which looks like it is trending towards), I was thinking of switching to the scraping engine. It's awesome that I won't have to change a lot if you do switch to the API. I'm following on GitHub, keep up the great work!
I'm going to start diving into the API to build a simple, powerful "Google Alerts for HN" app on Assembly, and I'd love help from anyone who's interested: https://assembly.com/hn-monitor
There are some products like this out there, but they had to rely on scrapers and the HNSearch API, so I've always found them to be spotty. I think we can make something better.
My only problem with the API is that you need to do an awful lot of Ajax calls just to get something out of it. The topstories endpoint just gives you an array of IDs and then you need to do one Ajax call for each ID to get the story.
Oh well, the site is a work in progress. Not done yet.
I spent about 20 minutes throwing this together, so it's VERY rough, but maybe it will turn into something useful. I'm primarily a java programmer but I've been wanting to teach myself more ruby, so here it goes.
For my twitter bot https://twitter.com/hn_bot_top1 I use http://api.ihackernews.com/ at the moment. This works but the site (and the API) is quite often not available. So I'll probably switch to the new API as soon as I have some spare time...
Smart move using Firebase. This instantly gives developers traction with the API and as a big API client guy, I love making clients but also very happy we'll have all the tools we need to get started using the API right off the bat. Considering Firebase doesn't have rate-limiting too, this means that the things people can build with this API are limitless.
I am excited about this because now I can finally build a way to query my own posts and comments. I often times come across products/services/cool hacks on here that I vaguely remember, but cannot always locate. Using built-in search is kludgy, and I'd like to be able to do something more complex. Thanks for doing this!
It's interesting that you mention the phenomenom of "I remember seeing it (or maybe even saying it) but I cannot find it". This has happened to me a lot. It is intensely frustrating to me because it's blocking - I don't get much done until I re-find whatever it was.
I wonder how many other people have this? And what their techniques are?
I have the same problem, and I think the best solution would be to index my web browser search history and be able to search through that index. However that would probably generate a lot of junk. My current solution is to bookmark anything that's vaguely interesting. And when I remember something, I search through the bookmarks. I never look at them, they're just for keeping the stack of stuff to search through.
I am using the same approach too. However, most of the old stories I had liked at HN were not bookmarked, so I had to struggle writing a scaper for them.
It would be very cool to be able to authenticate and grab all my saved stories and the comments for archiving and fun!
I use your app, it's awesome! Also agree with your request, as I said elsewhere in this thread...
By the way, animations (the slide between list and comments) have been jerky since a recent update. I'm using the Android app on an HTC One M7 with the latest stock ROM (4.4.3 and Sense 6.0). A few other people have mentioned this in Play Store reviews. Is this a known issue?
If someone would be interested in contributing to an OSS project to build an iOS HN client, please have a look at https://github.com/bonzoq/hniosreader.
Does anyone have a good solution on how to get around making a separate request for each Item? Is there somewhere we can pass an array of Item Ids? Is this planned for the future?
Thanks HN!
EDIT: I just read the post about using the Firebase SDK to do this efficiently.
Nice! I'll be changing my Mac system tray app HackerBar over to use this instead of some Objective-C scraping magic that it uses now: http://hackerbarapp.com/
Great news, I've avoided touching my iPad reader because of the whole scraping issue to get at some of the data. Now I can justify updating for iOS8 + the new API. (just hope it gets approved in time)
Perhaps keep the old site still running on a different URL so scrapers who can't get their act together in 3 weeks can just change the URL they're scraping from.
Disappointing they are still going to use Arc even with the update to modern HTML. I despise how with Arc if I leave the page open during the day and try to click a link later the link has expired.
They should really update to a modern web framework at the same time. Big modern frameworks like Rails are making ridiculously awesome improvements like replacing page loads with XHR (quicker loading since JS/styles/etc is all loaded already, no screen flash, etc.) in a progressive enhancements manner.
So Arc can't even generate fully functional links, let alone keep up with modern web advancements.
What is it about Arc that causes the unexpired link issue? I know very little about Arc. But I understand it to be a language whereas the issue you describe appears to be the function of some implementation decision. Can you please elaborate?
(I'll take a crack at explaining this momentarily...)
Good grief, that took a long time. Here you go: http://pastebin.com/bSW5dfRQ [1]. I'd better stop neglecting my duties now!
Edit: One thing I forgot to put in there: one reason the closure technique is powerful is that you're leveraging the programming language and runtime to do most of the book-keeping for you. Whatever data is handy, you just reference. The system will remember all the references. That's why using things like query strings and hidden form fields is more complicated: you have to handle all those details yourself (not to mention serialize and deserialize them if you're passing through any other format than what your program keeps in memory). That is tedious, and when your app has many kinds of request, the complexity quickly piles up.
Of course there are other abstractions you can build over this, but closures are an elegant one—especially in cases where programming simplicity is more important than scalability, which is most cases.
You can't simply decode each character without losing information. For example, < means a literal < character to be shown on the page, as opposed to a < in the stream which starts an HTML tag.
If you're just planning on displaying the text in a browser, no decoding is needed. If you want to parse the text to do some sort of textual analysis, an HTML parser library might be best.
I understand what you're talking about re: < and '<' -- the json -looks- page (terminal in my case) displayable, barring the &#xhhhh; encoding. cURL has facilities for decoding %20 (for example), but not what we're getting back w/ this json.
You've given me an idea though, so back to vi for me.
Yes, but someone willing to get the HN posts about a certain topic from the API will need to keep polling the API to get all the new submissions, and filter out the content irrelevant to her purpose. Search directly from the API would be so much more convenient.
Why is so much preparation necessary to redesign like 3 simple templates. Shouldn't it take like ~10 hours for one person to do this? I'm talking about the front end.
Why wasn't hacker news optimized for mobile a long time ago?
As soon as someone uses this API to create a replica of the site where everything else is the same, but the design is responsive, I want to know about it :)
UNIX time is actually much easier to parse and more accurate. Almost all platforms (even Windows[0]) have a way to convert a UNIX time into a culture specific local time.
What you're asking for is a string you have to parse. That's a lot more work and there is a lot more that can go wrong.
How is it more accurate? ISO-8601 allows for leap seconds, it uses the same second counting frequency as the Unix timestamp, it can represent dates before 1970 and after 2038 (64-bit Unix timestamps will also.) and it's an international standard.
There are very few good reasons to choose a Unix timestamp to represent a date when compared with ISO-8601.
When UNIX is converted into the local culture leap seconds are added (along with time zone, daylight savings time, et al), defining dates before 1970 is irrelevant (and also untrue, UNIX times can be negative), and as systems are moving rapidly to 64 bit the 2038 deadline will be irrelevant (e.g. my Chrome AND their server software are both 64 bit already).
> There are very few good reasons to choose a Unix timestamp to represent a date when compared with ISO-8601.
var time = new Date("<ISO-8601 string>") // in Javascript
time = DateTime.iso8601("<ISO-8601 string>") # in Ruby
Aaaand it's human readable! Aaaaand it works before the Unix epoch!
EDIT:
Changed DateTime.parse to DateTime.iso8601, to be even more retentive.
Look, parent claimed that parsing ISO strings is hard (it's not, especially if you're consuming a web api on a modern web language) and that it was more readable (which is so clearly wrong I have no words).
As for being more accurate, again no. The range is worse (lol wraparound if you're using a 32-bit int), there isn't explicit support for fractional seconds, doesn't map onto UTC cleanly, doesn't handle leap seconds, and so on.
It's only "easier" if you don't actually care about a human-readable timestamp that is robust and if you desire to do date parsing yourself instead of using any of the well-established libraries out there. Ugh.
> and that it was more readable (which is so clearly wrong I have no words).
I never claimed that. In fact I didn't address readability at all. So I have no words for your "no words" relating to a claim that literally didn't appear at all.
> The range is worse (lol wraparound if you're using a 32-bit int), there isn't explicit support for fractional seconds, doesn't map onto UTC cleanly, doesn't handle leap seconds, and so on.
Fortunately we're already well on our way into a 64 bit world, and aside from legacy systems it won't be a problem by 2038. According to the Steam hardware survey [0] over 80% of Windows machines, 100% of OS X machines, and 90% of Linux machines are already running a 64 bit OS.
Leap seconds can be handled during the cultural conversion.
> It's only "easier" if you don't actually care about a human-readable timestamp that is robust and if you desire to do date parsing yourself instead of using any of the well-established libraries out there. Ugh.
I don't care about human readable timestamps for an API used in automation. More robust is subjective, particularly as parsing it is more technically complex (particularly as most of the parsers support several different but similar DateTime formats).
Most well-established libraries support UNIX time natively or use it internally.
In Perl:
my $dt = DateTime::Format::ISO8601->parse_datetime( "2008-08-09T18:39:22Z" );
~
Interchange formats, say JSON blobs over a wire, should very clearly express what's in them, perhaps by using a very well-known standard which is human-readable, whenever possible. The fact that some languages haven't yet realized that this is an important-enough feature to put in their standard libraries compared with whatever esoteric academic shit they think is necessary (C++1x, for example!) is not the format's problem.
Hint: if you're consuming a web API, you are probably using one of the languages I gave examples for, or a very close relative. Just because your Haskell-on-M68k package doesn't know how to 8601 doesn't mean that using a nondescript number is a good idea.
Why, in the year of our Lord 2014, is this even a fucking question?
EDIT:
I'm sorry to be so mean in my language about this, but I've had to fight a lot of raging stupid with regards to storing timestamp data. I do not wish to see anyone else suffer unduly.
> It's not a slight advantage if you've ever had to parse through logfiles by hand or debug APIs with curl--it's a great deal more than that.
Funny you should say that, UNIX time works great for log files. Since now even a dumb tool can search for values in the range, 1388491199-1420027199 to see all 2013-2014 lines.
Your way requires specific support for the date format, dealing with "2014" false positives, or searching through all 12 months individually (2013-01, 2013-02, 2013-03, etc).
You keep drumming on about library support, which is important for a format which isn't universally supported. Fortunately for UNIX time there's no value in listing them off one by one, you can just assume it is all of them...
UNIX timestamps are also listed in several international standards. Including POSIX.
[Firebase Dev Advocate]
@angersock - the "about" value you're seeing on the Firebase Dashboard isn't broken, it's just a truncated preview. The HN team is using email for issues, so you can send them any feedback at api@ycombinator.com.
The issue tracker on https://github.com/HackerNews/API appears to be disabled, though the summary says "Documentation, Samples, and Issue Tracking for the Official HN API."
Please consider opening up the issue tracker. Other people might benefit from seeing these issues, such as the issue about the comment count that I just raised as a comment here.
slight nitpick: i agree with you (i.e i dont think people should be complaining about the markup) but this issue is not that obvious when the "users" are the "programmers" complaining....
I always thought HN was more of a one man job. I didn't realize there is an entire team working on it (full time?). Honestly there doesn't seem to be much to work on here. It's a simple site that rarely changes (at least on the front end).
I believe the whole idea of having multiple people working on it permanently was only introduced six months to a year ago(?) And it seems like big changes are happening underneath us, mostly in terms of story quality: https://twitter.com/sama/status/519240112907894784
I hope not - what if that one man gets hit by a bus?
I always assumed HN was more of an operations challenge rather then a purely algorithmic one. Lotsa traffic I assume. Not sure about the scope of user/comment moderation around here but something tells me their item ranking system just might be that good.
And what about the backend? Last I heard it's all stored as files on a handful of machines, so it's not like it's some off the shelf database you can just keep throwing machines at to scale. Also it's written in lisp originally by one person, so have fun figuring out how it all works and ramping up to it quickly.
There's also certainly a big support overhead dealing with the moderation of stories and comments.
I'm not saying it should take hundreds of dedicated engineers to run the site, but I think it's silly to look at it and think you could run it in your spare time because it's just a list of stories.
I count 8,483 submissions. I'm sure there's something interesting to be done with all of this data. A word frequency chart?
---
Edit: So apparently there's a ruby gem that lets you feed it a body of text and generates pseudo-random phrases based on that text.
I present to you the patio11 impersonator: https://gist.github.com/christiangenco/e8d085e47479be0131e1
One of my favorites:
---Also, a word count on patio11's submissions: 1,052,351. For comparison, all 7 Harry Potter books total 1,084,170 words. patio11 has written the entire Harry Potter series worth of content on HN. Just... wow.