Ask HN: Please review our startup Euraeka.com

aristus · on June 23, 2009

Where do you get this data on millions of users?

The name is impossible to spell.

The misleading, engaging, etc filters are interesting, but kind of hand-wavy. It's not apparent why and how "Miami Heat's Dwyane Wade sues ex-business partner for libel" or "The SEVEN SECRETS of SMART PARENTS" are "misleading" (and compared to what?).

You need an information/interaction designer to go over the site and make the important things important. Right now nothing really catches the eye.

on /faq: "ulterior motives", not "alterior"

Good luck!

haidut · on June 23, 2009

We've nee collecting top rated news from Yahoo News, NYT, Washington Post, etc for the last 3 years. Almost all major sources have sections "Most viewed", "most read", "most email" etc. So as we have been collecting the top ranked news we also kept track of topics that have been top ranked at multiple sources. For instance, if a news article on Iran's riot gets to the top ranked in both NYT and Yahoo News it gets more points in our training set.

As far as your other point, Misleading is really a less legally loaded word than "deceptive". You are right, it's not very clear why the articles are misleading but the bottom line is that the language used in the article has high "deception markers" that other articles from the day. So when you sort by Misleading, it's not really that the article is beyond a doubt deceptive, it's just the ones with higher rate of deception markers (i.e. content and structural indicators associated with deception). The science behind is pretty solid and comes from forensic psychiatry - i.e. interviews/interrogations with criminals and analyzing their statements for deception hints. So for alack of a better term Euraeka essentially implements a linguistic polygraph. Thanks for the other comments, we'll work on fixing the issues.

aristus · on June 23, 2009

Are you conflating/clustering articles? ie what constitutes the "same story" in your system?

Those "most viewed", etc boxes are often placed by editors, not by impartial algorithms. How do you control for that?

haidut · on June 23, 2009

Yes, we definitely have clustering. In fact, it's more extensive than simple word distances like cosine b/c we also take into account synonyms and word relationships (set membership, etc). For instance, in our system sentences like "Tiger chases antilope down the river" is very closely "related" to the sentence "Lion is pursuing a buffalo by the lake" b/c both sentences essentially say that a large cat is pursuing a prey of bovine origin near a water source. In terms of the "most viewed" and how we control for that - like I said we cross-track news on multiple news sites and weight the cross posted one more often. We also cross-validated the most important tags for an article by using Google Trends data. Basically we tracked topics on multiple sites and then performed some statistical analysis to see how those topics did over time based on their presence on the web (topic momentum and longevity). We also run a partial search engine in house that crawls a subset of the web so we can ensure that the numbers we get from Google/Yahoo are legit. Finally, there is linguistic theory of topic popularity and how memes propagate over time. We use some of that theory to control for the crowd effect - i.e. sometimes people pick up and spread topics that are of no real importance to the world. Example: Paris Hilton's latest escapades may be widely discussed online and appear important news but the latest report on the recession estimates and projections is of much higher "impact" to society. So we try to account/estimate some of that "impact". Combining all factors gives an article a composite score. No two article really have the same score but a lot of articles cluster close to each other in terms of their "importance" cores. We fed the articles in a machine learning algorithm that is a combination of Support Vector Machine, Neural Network, and Naive Bayes and when a new article is fetched by our crawler the model "preditc" its various scores (controversial, engaging, popular) based on the data set that it has already learned. Deception detection is much trickier and is almost entirely analysis based - i.e. no machine learning there. There is quite a bit of research on deception detection published online. Just search google for "deception detection ext:pdf" and it will come back with a lot of results.

ujjwalg · on June 23, 2009

I think the concept is very intriguing and if what you say is what is on your site and you have actually made it by collecting all the information in the last 3 years from all the major news websites, I think you will end up being bought pretty soon. Rather, what you should do is patent your process asap, if you haven't done it already and then license it. Amazing and good luck.

haidut · on June 23, 2009

The patent application is in the works. As far as the the data set - we definitely have it. In fact we were thinking of releasing it under some type of open license (i.e. creative commons) after the site gets some traction. In terms of search engines - we ARE in fact a search engine. Just type something in the box at the top and you can also use the available 4 score to filter rank search results. So the search works just like google but you get to sort the results by Controversial, Popular, Engaging, and Deceptive. Some pretty interesting combinations can be created using the scores. Like for instance, you can search for "paris hilton" but you are interested in her scandalous "achievements" rather than her community work. Well, then you search for "paris hilton" and sort by Controversial. Google can't give you that - i.e. sort results based on what impact are they likely to have on people. Thanks for suggestions!

ujjwalg · on June 23, 2009

I tried it for a couple of keywords with different ways of sorting and it seems to be working great.

My personal feedback for this would be: to make it sticky you need to have something similar to what google news webpage looks like and have 3 sections (controversial, popular, engaging) in every section. And then you should have similar features (keyword news and number of news article in every section) and I will make it my homepage, no kidding. Currently, you are not utilizing the complete web space very efficiently.

haidut · on June 23, 2009

Yes, this is what we were thinking of having eventually - separate sections based on score. The current design is something we threw together very quickly to get it out of the door. Btw, the article tags are clickable and run a search for that tag in the background. Also, you can filter news by domain. As we accumulate more data, we can start ranking domains based on the articles they have produced so far. Kinda like PageRank but not based on links. Finally, the algorithm can guess authoriship and cluster articles based on author as well. So again, we can rank authors/people after some time. This seems to be a much needed feature as this links discusses: http://threeminds.organic.com/2009/06/docs_are_old-school_we...

I think the above article came up on HN today.

ujjwalg · on June 23, 2009

Another point I want to add, is if this is possible, your algorithm should definitely be added into any search engine page rank system to not only get rid of click fraud but also improve the searches.

jackdempsey · on June 25, 2009

Just wanted to say thanks for the comments. We've worked hard on this, and there's obviously still a ways to go....but we definitely appreciate the constructive criticism, and the time taken to reply.

jack