Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mongo DB is web scale (xtranormal.com)
154 points by roder on Aug 26, 2010 | hide | past | favorite | 65 comments


To the good folks at 10Gen, Antirez, the Cassandra project, LinkedIn, Google, Amazon, and everyone working to advance the state of datastores available for increasingly specific applications, your work is undeniably important. The team I work with has been evaluating a handful of emerging datastores for some applications and are making plans to migrate a few types of data in some of our systems from one to another. Thank God that we have choice beyond a standard RDBMS and BDB.

To the armies of bloggers parroting slogan after slogan, ricing benchmarks so far removed from real-world applications as to place themselves somewhere on the spectrum between meaningless and malicious in the name of pageviews, well, that's not helping anyone.

It's great to see people get excited about new technologies. But if tech/startup culture is one that embraces and celebrates the fail, then we damn well better also talk about the different areas where many different datastores are not appropriate choices, and where some of them downright break, risking extensive downtime or corruption, require unreasonable amounts memory for indexes, or multiple machines to provide for durability, or return inconsistent "postmodern" results, or what have you -- without it being understood as a personal attack on any one individual, company, or sector.

The productivity of this video is debatable, sure. But perhaps we can appreciate that it parodies the "magic scaling sauce" image that the "tech press" (face it, we're it) has given surprisingly young but maturing technologies. There's no magic sauce, no drop-in answer to "web scale," and certainly nothing fraught with difficulty. It's data. If you have a ton of it and it's valuable, it's worth mountains of expensive programmer time and effort to ensure its integrity, accessibility, and utility based on the storage and query requirements of an application.


Totally agree, as i am a beginner programmer, i am totally confused which database to use and when, particularly when. I have read up on so many, so many blog posts that contradict each other, i have looked into coudb, cassandra, mysql, neo4j, hbase, persevere and a few more. I have just decided to learn each one well and judge as my knowledge grows, so i have started with mysql and couchdb.


Personally, I love Postgres; I've been burned by MySQL in the past. Here's the thing, you will be infinitely more employable if you know SQL well, and you'll easily retrain from one RDBMS to another if a better SQL database comes along.


Don't forget MonetDB (performance) and SQLite (simplicity) if your project is mostly reading.


You should try a RBDMS too :P


There aren't any relational databases doing any thing useful, only non relational databases, like those based on SQL.


To clarify this comment a bit: The conventional RDBMSs people refer to as relational databases (Oracle, Postgres, MySQL, SQLite, etc.) shoot for compatibility with the SQL standard, which diverges from the relational model. For example, you can have duplicate rows in a table, which doesn't make any sense in the (set-based) relational model. SQL is not "purely" relational.

If you read Chris Date's books (I recommend starting with _An Introduction to Database Systems_), he really hammers this point.

There are people who hate SQL because it's too relational, and there are people who hate SQL because it's not relational enough.


These videos are certainly funny, particularly the iPhone vs HTC one that was popular a few months ago. But as cscotta pointed out there's an interesting trend of picking certain flaws in particular databases and highlighting them without telling the whole story.

Anyone watching this might believe that MongoDB has a major flaw where data isn't written immediately and you have no idea if it has successfully been stored to disk. Whilst it's true that inserts are not immediately written by default, you can a) change the startup config to set the delay time b) force a write of all pending changes from the command line c) force the write from your call to the insert/update/remove/etc method in all the libraries and most importantly, d) request the library method wait and return the response from MongoDB so you can determine if the write was successful or not.

This means you have complete control over when you need fast inserts at the expense of potential data loss, or when you need to be certain the data has been written.


While you're correct, I don't believe the point was that you had no control over it. The point was that their benchmarks came from using delayed-write setting and compared it against databases that ARE writing to disk. He's saying that mongo is cheating the benchmark by doing work after the timer has stopped.


Maybe cheating is ok. Sure, there is a reason why it's faster. Maybe the guarantee of the write to disk is not as necessary as some people would believe.

It's certainly not the same as writing to /dev/null.

Maybe the developer can have other ways of guaranteeing consistency.


Ironic that transactions and guaranteed data integrity as performance tradeoffs are now being used as arguments for MySQL and against those pesky new up-and-coming open source databases.


Funny: yes. Meaningful: no. The tradeoffs between MySQL and memory-backed k-v stores are in a different league than the old tradeoffs between MySQL, Postgres, and Oracle.


I wouldn't call it a different league.

MySQL had severe data consistency issues before InnoDB came around and even today, on InnoDB, there is a variety of situations that will cause silent data truncation or silent data loss.


tl;dw: 99% of web applications will never need to be "web scale," so quit worrying and just use an RDBMS.


99% of web applications will never need relations, so just go the easy route and don't worry about a relational schema for your data.


Uhh, in the context of an RDBMS a "relation" doesn't mean a "relationship" in the sense of a foreign key, it means a relational algebra relation, saying most applications will never need relations is like saying most applications don't use data.


Most applications won't need the features of an RDBMS that MongoDB & friends don't provide, so whatever definition of "relation" I used, its irrelevant. MongoDB provides enough "relational" algebra for 99% of web apps, without the additional "relational" constraints that a RDBMS forces on you.


Oh Codd!


Quit Datein' on the young man.


I'm laughing badly thanks to this comment :D


Technically a relation has to have a primary key so all that set theory is valid. A system could be based on bags, like SQL allows, can also have data.


Actually that's one of the most ignorant statements I've ever heard. I'm hard pressed to think of a single application that does not have related sets of data.


Sure the guys language was imprecise, but the SQL implementation of the relational model was imprecise as well, so we call all get it wrong with gay abandon.

The funny thing about relational databases they model relationships fairly loosely, mostly understood and expressed by the user, not the system. Object systems have a stronger idea of a one way relationship with a reference anyway.

How many systems should care about relationships. All of them can.

Here in Australia they want to put everyones health records online, and connect all the hospitals, doctors, drug stores, testing labs, etc. to those records.

So far they have spent heaps, $500 million'ish, on a system that does this:

http://www.abc.net.au/rn/healthreport/stories/2010/2975642.h...

I think this is a good place for something like a non-relational database. You just create a sequential list of events for a person and you are done.

You can use something like CouchDB. If you loose your network, you can have a local replica at your local hospital and doctors office. Everything re-syncs later.

I think the whole thing is in analysis paralysis because they are trying to systematize the health universe so the meaning of all data is unambiguous. While that sounds grand it will take forever.

I blame the whole failure on a drive to systematize data, which SQL databases foster. If you just view health records as a bunch of bits of paper, and just say we want to store them in one place, in a way that is vaguely sequential, the problem is much simpler.

CouchDB is after all web scale, shard friendly, and there is no impotence mismatch. After seeing this video I am always going to want to say it that way.


>I blame the whole failure on a drive to systematize data, which SQL databases foster. If you just view health records as a bunch of bits of paper, and just say we want to store them in one place, in a way that is vaguely sequential, the problem is much simpler.

It's simpler, and tremendously less usable.


A system that's actually running is more useful than one just planned. Really though, current paper records are "document oriented databases", where each record is self-describing (at the cost of redundancy) and access latency is the biggest issue. Upgrading to electronic document oriented databases would actually be a very natural fit, whilst still leaving open the possibility of further relational optimization in the future.


"A system that's actually running is more useful than one just planned"

Absolutely. To go back to the original point, I don't think it's document-centric or not that causes the massive cost overruns. Instead it's that big consulting companies know that they can absolutely ROB the government blind and get away with it. Here in Canada we've seen the same farce with the long gun registry, with the health records here in Ontario....just about any government-related project is a boondoggle.


I worked on a project at a bank that started insignificant, and became significant and the loss of quality and the increase in waste was phenomenal.

It is great perceived importance that makes a boondoggle.

A few employees, government or otherwise, doing a project of no great significance, will not waste much.


I'm not quite following, however the projects I am talking about have pissed away hundreds of millions to over a billion dollars. That is what makes them significant.


If people did not consider those projects in significant in some way other than costing millions of dollars, they would cancel them.

Being insignificant creates a need of efficiency.


> Actually that's one of the most ignorant statements I've ever heard

It's pretty clear - i.e., his use of the term 'schema' - that he meant in-DB relations, rather than in-code relations.


What about this one? http://aws.amazon.com/s3/


One region hosts many client accounts. One client account has many services. Services are linked to charge rate tables. One account can have multiple logins. One service was many log entries. ...


Plain text files FTW!!! :-)


That's how I keep my source code. People tried more structured storage and text is winning.


Discover Smalltalk.


Text is winning.


I would also wager that 99% of web applications have their data normalized between multiple tables requiring many joins to reconstruct the data for use by the application. Document databases can provide a more representative structure for storing data than splitting the data across multiple tables and records. Scalability is not usually my primary concern for choosing a data store. It all comes down to choosing the right tool for the job and not always defaulting to use a hammer for all tasks.


This is hilarious:

"Why not write to /dev/null? It's fast as hell"

"Does /dev/null support sharding?"


I choked on that one too, but mainly because it echoes the kind of thing pseudo knowledgeable non-tech folk say in real life. At least in my real life.


Yeah right. And /dev/null jokes are as old as Unix and always good for a laugh.


Hilarious video...but it reminds me of the painful fact that I know very little about databases so choosing between SQL and noSQL for my web app just seems arbitrary for me. I wish there were some simpler explanations out there.


First learn SQL and the relational paradigm.

SQL represents 40 years of experience designing general data stores that work for the widest set of applications. SQL databases solve many very hard problems which you are probably not aware of.

If you jump into NoSQL first you will be reimplementing SQL features in your application code, and doing a shitty job of it because you have experience with data stores that actually solve these problems well.

The reason for the existence of so many NoSQL databases is the rise of web applications and the need to scale massively. However the majority of apps will never need to scale beyond a single well-tuned database server anyway. By the time they do you will have hard problems to solve regardless of what data store you used. The advantage of SQL is that it's a fantastic hedge on the evolution of your data usage patterns because it is designed to support ad-hoc queries well, and the schema prevents bad application code from thrusting your data into chaos at the first occurrence of a small bug.

Realistically if you knew you had to build an app for 5 million daily users, and you knew exactly what it was going to do, then an SQL database very well might be the wrong choice. But in the real world you have a long road ahead before you hit that scale, and you'll have real data to determine what kind of alternate data stores can best handle your load. Personally I'm a huge fan of redis, and its ability to scrape bottlenecks off a MySQL database in a piecemeal fashion.


NoSQL isn't just about scale. Sometimes, modeling data relationally is a real bitch, and a document store makes more sense.


It's not a huge win though, because most of the time you still do have some relational data, and there's no reason you can't dump documents in an SQL database. The fact that the interface is new, shiny and slightly more elegant for this simple degenerate use case doesn't carry much weight with me.


Real admins use the best tool for the job, anyway. "Polyglot persistence" is where it's at: You keep your sessions in Redis, you keep your news feed in MongoDB, and you keep your credit card details in Postgres.

This is a false dichotomy.


There needs to be a threshold of utility before you add an additional data store to your application.

Just because you want to use some unstructured data (which was your original example) doesn't mean you need a new data store that's optimally suited to that. You can store documents just great in an SQL database or in the filesystem.


This is the important bit. Just about every company or web app will have a bunch of relational data and therefore an SQL database of some kind. And storing key-value pairs in an SQL database is really not that hard or inconvenient or slow or whatever - it's gonna be 'good enough' for just about everyone. Why bother supporting an entire extra dedicated KV store when the SQL DB I already have will work just fine?

'Best tool for the job' is an oversimplification.


And lots of NoSQL databases do have these kinds of value-add features. For example, FourSquare writes data into Mongo (and Postgres, as well, actually...) because it has location features built into it.

http://www.mongodb.org/display/DOCS/Geospatial+Indexing


Sure, you can create a massive json hash and store it in a sqldb field. However, most sqldbs that I'm aware of do not support searching that massive hash. You need to come up with index tables.

The nice thing about Mongo is you can query that hash without having to have an index table.


Yeah, the reason I was thinking of using mongo is because I have to write a simple posting app (i.e. allow for blog or twitter style posts with tags) so defining a relational schema seems like overkill to me. But I can't be sure.

To the parent: thanks for the edifying comment. I do have some experience with SQL and relational DBs, but was thinking of using noSQL for some projects. Your point to thoroughly learn the relational model is well taken.


Even here, you might be surprised by how limited some NoSQL stores are. E.g., if you have a requirement like "find the first 20 blog posts written by X after this date."


Knowing the relational model is really important, but you're also going to have to unlearn it in order to do a good job with NoSQL. It's a totally different set of rules.


Can you give one instance of this?


This sort of thing is hard without getting into tons of specifics. Now, I was raised on good old normalization and all of that, but it doesn't mean that I'm a master at it. It's possible that there's a good way of relating this that I'm just overlooking. Or maybe I should have just denormalized it from the start.

Okay, so we have a domain object, Foo. Foos represent individual instances of a Foo that a user has, but we want to keep general information about the different standard types of Foo, so we also have a relation between Foos and FooTypes. Oh, and each FooType can have a few different sizes, and some FooTypes are the same sizes as each other, so we also need a FooSize. Not only do we need to relate the number of sizes that each FooType could have, but when a User has a Foo, we gotta know which sized one they have. All this stuff... it's complicated.

It probably would have been much easier for me to have just done each FooType up as a document, with an embedded array of sizes, and then each Foo gets a document, with its own copy of the data. Yeah, there's nothing saying that you can't store de-normalized data in a relational database, but if you're not going to use its features, why not just use the tool that's designed for that use-case?


Traditional SQL databases provide some baked in guarantees regarding relationships between data and data structure. You design a schema and the server ensures that required columns exist, relationships between keys are enforced, and that all data is of the appropriate type. You're also given guarantees that certain things are done atomically. This means that a series of actions are performed with the promise that nothing will interfere between step one and step N. This promise is fulfilled through the use of transactions which incur a performance hit.

NoSQL has become an increasingly broad term used to indicate some other storage strategy than traditional SQL databases provide. NoSQL databases tend to operate on a more basic level and omit things like transactions in favor of performance. Schemas are also disregarded to allow arbitrary storage requirements to be addressed with ease; although, you lose the traditional guarantees that a schema can provide.

This is just my opinion, but I think it's a lot easier to setup a clusterfuck storage nightmare using a NoSQL database. If you're unfamiliar with SQL and NoSQL, you're most likely going to end up in a safer and more recoverable situation going with something tried and true like PostgreSQL.

If you're not sure whether you need a NoSQL system, then you don't need one. They serve a use case and performance requirement that very few websites demand. That being said, when the need does arise, they can be an indispensable for scaling out a large website. Like anything else, they're not a magic bullet that you just drop into your storage strategy and go on your merry way. Rolling them out requires lots of planning just like any other large scale deployment does.


I'm currently using mongodb for a new project at work and the biggest surprise for me is that there is no elegant solution for doing the SQL equivalent of COUNT(DISTINCT field). Count exists, distinct exists to return the set to you, but the combination isn't there.

The only solutions i have found are to check the length of the distinct query, which takes too long for a large result set, or to write a map reduce function which takes longer than I'd like and is a large amount of code for functionality that should already exist in the db.


I hate the fact that I want to upmod this submission.

(I withstood the urge, you probably should too)


As usual, use the technology that fits your needs.

If you want a lot of speed and can tolerate the possible (even if unlikely) data loss, use NoSQL.

If your business requires that your data is guaranteed and always up-to-date at any moment, then use RDBMS.


NoSQL doesn't mean NoDurability. Many of them offer better multi-datacenter durability and availability by dropping ACID for BASE. I suspect that most data by volume belongs in a BASE database. Some of them also support a mixture of immediate and eventual consistency in the same database.


Durability is a generic topic, not only the problem of NoSQL.


Some of the NoSQL solutions take durability very seriously, some put it second to looking good in benchmarks. NoSQL is about choice, and the most durable of the NoSQL stores are more durable than many of the venerable relational databases.

For instances, what CouchDB treated as a major bug, is the accepted behavior of many relational databases. (Eg, data isn't lost, but must be recovered via a long-running process should there be an uncontrolled shtudown.)

Riak and Cassandra also have modes that treat durability as paramount, and give you better assurances than MySQL or even commercial RDBMS products.


How ironic – an article about "web scale" and a "503 Service Unavailable No server is available to handle this request." error when I try to open the link :) (I'm certain that Mongo DB is not the one to blame though)


Is this some kind of generative cartoon? Plug in a dialogue in text, and it generates too teddy bears speaking the dialogue?


Funny stuff. How did you make that and how do you synchronize, the mouth movement with the dialogue?


http://www.xtranormal.com/

"If you can type, you can make movies."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: