PredictionIO – open source machine learning for predictive features

duaneb · on April 10, 2014

I wish it didn't depend on MongoDB. Is there a good reason for not using a more general purpose database, like SQL or any of the billion key value stores out there?

dszeto · on April 10, 2014

We picked MongoDB a while ago because of its built in geospatial indexing support. Would be happy to add support of other databases. Any strong candidates on top of your head?

darksaints · on April 10, 2014

Both Postgres and SQLite have extensive geospatial support (PostGIS and SpatiaLite respectively), including indexing and boatloads of useful geospatial functions. And in my experience, both of them are far more advanced than Mongo's geospatial support.

duaneb · on April 10, 2014

I haven't worked with geospatial databases as much (so verify this with someone who has), but Postgres is a good bet for an alternative to mongo in this scenario.

yaph · on April 10, 2014

PostgreSQL / PostGIS

todazar · on April 10, 2014

RethinkDB!

bduerst · on April 10, 2014

I know it's a pain to get data into it, but how about Big Query?

tstonez · on April 10, 2014

Ideally open-source. Surprised MongoDB 2.2+ is such an issue but thanks for the feedback and suggestions. We'll look into PostgreSQL, Riak and RethinkDB.

bduerst · on April 10, 2014

It is based on an open-sourced db called Dremel:

https://code.google.com/p/dremel/

yipjustin · on April 10, 2014

But dremel doesn't support incremental updates. Dremel is designed for read-only data. All its columns are indexed. The whole table needs to be re-built after update.

smhchan · on April 10, 2014

MongoDB is used as the datastore for unstructured data, e.g. item attributes and user attributes. It's also used as a cache for prediction results, so queries like geospatial search can be performed.

There is no specific reason reason to stick with MongoDB only. It just happens to be the database the team has picked for the first implementation. It is very likely to support other databases in the coming future given the strong community demand.

lefrancaiz · on April 10, 2014

It also comes with a Vagrant box if you're looking to just try it out with your dev environment. I've tried it personally and had a good experience with it. I spot checked the recommendations and they felt really good.

http://docs.prediction.io/current/installation/install-predi...

wheaties · on April 10, 2014

Sounds fantastic right up until I read requires MongoDb. Thankfully it's open source and we can fix that.

dszeto · on April 10, 2014

We can fix it together too if you prefer. Any suggestions on database choice? We are open to discussions and feedback.

wheaties · on April 10, 2014

Yeah, I've been looking at PredictionIO. I'll happily contribute. I'll reach out to you guys.

smhchan · on April 10, 2014

What would be the better alternatives, in your opinion?

berto99 · on April 10, 2014

Why is mongo a problem?

micro_cam · on April 10, 2014

For me it is a red flag in terms of scalability as lots of our data sets won't fit in mongo backed by a 1-2 TB disk even if they take up < 100 GB in the original format (usually binary/compressed genetic data).

It also uses a ton of ram and performance really suffers when the data won't fit in ram so it isn't a great choice if you are trying to push the limits of what your machines can do.

They are only using it to store models and whatever "behavioral data" is but models for things like random forests can be really big and you want to be able to write/read trees from separate machines etc.

I wonder why they chose to use mongo vs local disk or HDFS which they already require.

smhchan · on April 10, 2014

it's the real-time prediction query, e.g. geospatial search, that makes use of mongo's indices.

micro_cam · on April 10, 2014

Thanks for the clarification, the write up isn't clear. Have you benchmarked against postGIS or stock mysql? And tried any larger-than-memory databases?

We were using mongo in a suit of web applications that display the results of ML and statistical analysis of cancer data and we've found its query performance lacking in a number of cases...I think the mongo geospatial index is a pretty simple geohash setup on top of their normal query engine and I would expect it to have the same issues.

I do think this project is very interesting, just providing my feedback based on doing similar work.

Memory overhead of both mongo and hadoop would actually be my biggest worry since, especially on desktop workstations it is quite common for machine learning tools in R or python to need most of the available memory when tackling even small problems.

duaneb · on April 10, 2014

Unless there's something about Mongo that means it's perfect for machine learning (unlikely), the last thing I want to maintain is yet another database because they didn't offer any choice.

ironchef · on April 10, 2014

A number of people have been bit by issues in mongo in the past such as: the approach it had taken to write locking, that it has silently discarded writes in certain cases, the charge that it uses inflated storage on disk, and the performance characteristics when the working set does not fit into memory. I'm sure there are more but when it arrived it had great marketing as was touted as the greatest thing since sliced bread. Unfortunately, some people ended up with horrendous sandwiches and remember the awfulness of said sandwiches.

lukasm · on April 10, 2014

I heard about two cases when MongoDB failed doing The Most Important Thing - storing data. No one really care about autosharding, no migrations etc. if you can't store the data. Due to some replication issue data was inconsistent.

berto99 · on April 10, 2014

But can't this happen to any db system? Mongo is pretty new and I'm not surprised things like this happens from time to time until the kinks are worked out. The new version of Mongo looks pretty good as well.

lucian1900 · on April 10, 2014

Not really, no database is this careless with user data.

Mongo is just really bad quality. Never use it, there's always something better in every way.

smhchan · on April 10, 2014

happy to explore other choices together with the community. Some users have voted Riak: https://predictionio.uservoice.com/forums/219398-general/sug...

louthy · on April 10, 2014

Riak is a good call. A SQL db would be nice too.

ing33k · on April 10, 2014

and what you replace it with ?

jey · on April 10, 2014

How did these guys get distribution on an official Mozilla blog despite being a third-party project?

rnyman · on April 11, 2014

I'm Robert, the Editor of Mozilla Hacks. We publish articles about anything regarding to the Open Web and is open source that we believe developers can learn and get inspired from.

More information at https://hacks.mozilla.org/about/

idlewan · on April 10, 2014

It's not new for hacks.mozilla.org to display write-ups about external projects that relate to the web.

dminor · on April 10, 2014

One of the really nice things about PredictionIO is that it comes with a dozen different recommendation algorithms out of the box, and lets you simulate their results with your data. This makes it much easier to decide which is the right one to use.