For me it is a red flag in terms of scalability as lots of our data sets won't fit in mongo backed by a 1-2 TB disk even if they take up < 100 GB in the original format (usually binary/compressed genetic data).
It also uses a ton of ram and performance really suffers when the data won't fit in ram so it isn't a great choice if you are trying to push the limits of what your machines can do.
They are only using it to store models and whatever "behavioral data" is but models for things like random forests can be really big and you want to be able to write/read trees from separate machines etc.
I wonder why they chose to use mongo vs local disk or HDFS which they already require.
Thanks for the clarification, the write up isn't clear. Have you benchmarked against postGIS or stock mysql? And tried any larger-than-memory databases?
We were using mongo in a suit of web applications that display the results of ML and statistical analysis of cancer data and we've found its query performance lacking in a number of cases...I think the mongo geospatial index is a pretty simple geohash setup on top of their normal query engine and I would expect it to have the same issues.
I do think this project is very interesting, just providing my feedback based on doing similar work.
Memory overhead of both mongo and hadoop would actually be my biggest worry since, especially on desktop workstations it is quite common for machine learning tools in R or python to need most of the available memory when tackling even small problems.
Unless there's something about Mongo that means it's perfect for machine learning (unlikely), the last thing I want to maintain is yet another database because they didn't offer any choice.
A number of people have been bit by issues in mongo in the past such as: the approach it had taken to write locking, that it has silently discarded writes in certain cases, the charge that it uses inflated storage on disk, and the performance characteristics when the working set does not fit into memory. I'm sure there are more but when it arrived it had great marketing as was touted as the greatest thing since sliced bread. Unfortunately, some people ended up with horrendous sandwiches and remember the awfulness of said sandwiches.
I heard about two cases when MongoDB failed doing The Most Important Thing - storing data. No one really care about autosharding, no migrations etc. if you can't store the data. Due to some replication issue data was inconsistent.
But can't this happen to any db system? Mongo is pretty new and I'm not surprised things like this happens from time to time until the kinks are worked out. The new version of Mongo looks pretty good as well.