MongoDB vs. Clustrix: Fault Tolerance and Availability

meghan · on Feb 2, 2011

I'd like to correct some factual errors from this article.

1) Failover of a MongoDB Replica Set is totally automated and requires no manual intervention. The replica set remains available for writes as long as a quorum can be established between remaining members. See http://www.mongodb.org/display/DOCS/Replica+Sets for more info

2) MongoDB does support different consistency models through Write Concerns and Safe Mode. The client can choose to wait for the transaction to be written to multiple replicas if it wants. See http://www.mongodb.org/display/DOCS/Verifying+Propagation+of... for more info

Disclaimer: I work for 10gen

sergei · on Feb 2, 2011

1. Say I have a 2 node replica set. Now a replica dies, permanently. How is the recovery automated? These are quotes directly from your docs:

http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+R...

"1. Delete all data. If you stop the failed mongod, delete all data, and restart it, it will automatically resynchronize itself. Of course this may be slow if the database is huge or the network slow.

2. Copy data from another member. You can copy all the data files from another member of the set IF you have a snapshot of that member's data file's. This can be done in a number of ways. The simplest is to stop mongod on the source member, copy all its files, and then restart mongod on both nodes. The Mongo fsync and lock feature is another way to achieve this. On a slow network, snapshotting all the datafiles from another (inactive) member to a gziped tarball is a good solution. Also similar strategies work well when using SANs and services such as Amazon Elastic Block Service snapshots.

http://www.mongodb.org/display/DOCS/fsync+Command "Lock, Snapshot and Unlock

The fsync command supports a lock option that allows one to safely snapshot the database's datafiles. While locked, all write operations are blocked, although read operations are still allowed. After snapshotting, use the unlock command to unlock the database and allow locks again

2. Really? Is this wrong then?

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Conce...

"Writes which are committed at the primary of the set may be visible before the true cluster-wide commit has occurred. Thus we have "READ UNCOMMITTED" read semantics. These more relaxed read semantics make theoretically achievable performance and availability higher (for example we never have an object locked in the server where the locking is dependent on network performance).

knbanker · on Feb 2, 2011

1. You really need a minimum of three replica set nodes, one of which can be a lightweight arbiter. If the primary fails, the secondary node will be promoted to primary automatically. In the case of a network partition, the old primary will come back up as a secondary with no problems. In the case of a true hardware failure, you can resync very quickly from a snapshot. For extra peace of mind, add more nodes to the replica set. You can have up to seven.

2. If you're reading from both primary and secondary nodes, then the view may not be consistent. In most cases you simply read from the primary for fully-consistent reads. You get to decide whether reads from secondaries are consistent or not by setting the write concern (i.e., the minimum number of nodes to replicate to before returning each write.)

sergei · on Feb 2, 2011

1. Yes, I recognize that MongoDB will automatically fail over when we go from N nodes in the set to N - 1. But how do I get back to N nodes? That's completely manual.

2. What happens when I read an update that succeeded on the master but then later fails on the slaves?

knbanker · on Feb 2, 2011

1. It depends on how the node fails. If there's just a network partition, then you still have N nodes, so no issues. If you're running with durability enabled, and you experience, say, a power outage, then the member should rejoin the set and resync with no issues. If a node's drive crashes, then you'll need to restore from a recent snapshot (within a day or so) or perform a complete resync if you don't have snapshot. But this can all be done without taking the replica set offline. In that last case, there is some manual work involved. But your post, unless you've corrected it, implies that replica set failover is completely manual. That's certainly not true.

2. Outside of some kind of hardware failure, you won't have situations where writes succeed on the primary but fail on a secondary. And as I stated on your blog post, if you're really concerned about it, you can specify a write concern on insert, and if the write fails to replicate in the desired way, you'll know about it.

sergei · on Feb 3, 2011

Sorry, but "hardware failure" is a fault, and when you can't deal with it, you're not tolerant. And with larger clusters, you see hardware faults on a regular basis. So saying we're ok in the nominal mode is not fault tolerance.

j2d2j2d2 · on Feb 2, 2011

These posts are written by one of the Clustrix founders.

megaman821 · on Feb 2, 2011

Are you trying to imply that the post has wrong information because of this fact? If so, attack the wrong data. I don't care who posts facts, as long as they really are facts.

jeremymcanally · on Feb 2, 2011

His last article was intensely ignorant of MongoDB. I appreciate his attempt to promote his product, but the last one showed that he'd spent about 10 minutes on the Wiki and that's it. Or perhaps he's more informed and conveniently left out a number of things that would have made MongoDB look better. I don't want to cast aspersions, but it wasn't a good argument.

This one does seem to be more informed (and I agree with a lot his criticism of MongoDB here), but it's almost like comparing apples to oranges. Things are done in MongoDB a certain way for a number of reasons (e.g., the query interface doesn't allow certain things in a distributed context that you could probably do with a SQL database). But I think anyone who's done large-scale MongoDB deployments can (or at least should) attest that it works well, but perhaps not as well as other solutions (or as well as it could/will work eventually/whatever).

smanek · on Feb 2, 2011

Can you name some of the mistakes/omissions he made in the last article? It didn't seem too far off my (admittedly extremely limited) experience.

AndrewO · on Feb 2, 2011

In the discussion of the previous article posted a couple of days ago, a lot of people complained that he initially didn't provide source code of the tests or go into details about the Mongo's configuration. Some felt that as someone who's worked with internals of RDBMS's, it wouldn't be fair to compare something he knows intimately with something he just learned and didn't spend any time optimizing.

After he posted the code, others complained that the scenarios didn't have enough multi-table joins (which Mongo would represent as nested objects and, at least in the commenters' opinions, would probably do better than it did). There was also a lot of more detailed technical discussions but I won't try to summarize them:

http://news.ycombinator.com/item?id=2161753

(Note: these opinions aren't my own, just what I got out of the discussion.)

jeremymcanally · on Feb 3, 2011

Yup, (nearly) all of those points are valid.

He could have gotten comparable performance by simply turning of safe writes, but the commenters point out a lot of other problems with his original assertions.

lucisferre · on Feb 2, 2011

I haven't read their analysis yet (I will try to when I have some free time), but in general, I would argue that trying to compare a document database to a SQL one is always going to be somewhat misleading. I'd care more if they were comparing Clustrix to MSSQL, MySQL, PostgreSQL.

If you are using MongoDB in a way that is similar to the way you would have used a SQL DB you are probably doing something wrong. Specifically, you are trying to place normalized data in a database designed for denormalization.

codex · on Feb 2, 2011

Sergei compares Clustrix to MongoDB because their target markets are very similar--not because the technology is similar.

As a startup, it behooves them to attack the low-end database market, but I suspect they've found that the primary market for a highly scalable low-end database lies on the web, but that market has chosen to go cheap-and-dirty with NoSQL. So now they're in the middle ground between fast-and-and-loose-and-free and my-enterprise-uses-Oracle.

I think a lot of web development is of a highly speculative, winner-take-all sort, so devs. want to be as cheap as possible until they win the web lottery. For all the flaws of NoSQL, software only solutions do allow developers to make very efficient use of their hardware by running multiple services on the same machine, or run them in the cloud. Once they hit the jackpot, they can afford to either go Oracle, hire software developers to work around deficiencies in their data store (e.g. Facebook), or use a data store from Amazon or Google or Microsoft.

That's a shame, because I think Clustrix is ultimately the right approach. The web has a history of doing the shittiest-and-easiest thing first (ColdFusion, anyone?) only to repent years later to the second-shittiest solution. Rinse, repeat.

sqrt17 · on Feb 2, 2011

> Sergei compares Clustrix to MongoDB because their target markets are very similar

Looking at a coarse, "they want money from people" granularity, the target market (people who need a database) may be similar. If we look closer, however, the target market has, more or less, two segments:

One is at the product-for-free pricepoint, where you make your money either by selling additional services (e.g. 10gen with MongoDB, Basho with Riak, MontyProgram with MariaDB and MySQL) or upselling them to an enterprise version (e.g. IBM's DB2 Express-C, whose 2GB limitation makes it perfect to hook people on a workload they would use MySQL for, Franz' AllegroStore, OpenLink's Virtuoso open source edition).

The try-our-product-for-free method means that all the cheap-and-dirty-folks have something in their grubby fingers to build the next blog, mom-and-pop online store, or whatever. The folks with actual money to spend can lower their initial risk by trying out a couple of different databases to see which one fits best, without even having to ask. Only when they're actually happy with what they've got they're going to fork over the cash.

The next tier is the "we'll have to qualify you before we send our sales engineer" tier where you play with Oracle and IBM, or Greenplum and Vertica, because your prospective customers already know that your product is good enough.

There's no real space in-between. Either you have an engineer with good knowledge but no discretionary budget (whose chooses something appropriate for the task, after testing on a real workload, and makes the DB a non-topic for everyone else), in which case a comparison that people cannot replicate is not going to help you, or you have a high-level decision taker with budget power but no time to try things out or risk a couple thousand on a startup that may have gone bancrupt when he most cares about it. This latter category will be thoroughly unimpressed by any benchmark that is not the TPC-C or similar. No lottery whatsoever involved, these are the rules of enterprise spending.

BTW, calling ColdFusion and PHP (which were a far superior alternative to writing Perl CGI scripts without templating or any kind of library support) shitty and Clustrix "the right approach" is something for which people in 2024 will just laugh at you, even if Clustrix manages to do extremely well end ends up as the second-shittiest solution to a common problem.

lucisferre · on Feb 2, 2011

Fair enough, I'll definitely take the time to read the article. Though I'm not sure it's a fair analogy to compare Mongo to ColdFusion in terms of doing the wrong thing first.

mjw0 · on Feb 2, 2011

Clustrix does give you the option to start with MySQL and then do a drop-in upgrade when your idea gets traction.

praptak · on Feb 2, 2011

That's a huge risk. What if the way I use MySQL does not go well with the way Clustrix is supposed to scale?

Loic · on Feb 2, 2011

Please do not comment without reading the article. You do not even have the word SQL in it. Sergei is addressing the different approaches to achieve Consistency/Availability and Performance between their solution and MongoDB. A good read and I must say, I would really love to have more details and more general overview of their algorithms. They could be used for other problems.

Note that I am a huge fan of MongoDB and using it in production since the 1.4 something release.

lucisferre · on Feb 7, 2011

Sorry your comment is entirely confounding. What do you mean by "You do not even have the word SQL". Are you suggesting Clustrix does not support structured query language or is non-relational?

sergei · on Feb 2, 2011

The article is not about the DBMS interface. It's about Fault Tolerance (what happens when stuff breaks) and Availability (can I still use my database when there is a fault).

MongoDB claims to support both. So does Clustrix. I'm comparing both claims.

ssmoot · on Feb 2, 2011

Most databases that I've worked with that need to stand up to any real load do end up getting fairly denormalized. So a database might contain multiple data silos.

A job is a job and data is data. Attacking the interface doesn't seem entirely fair. SQL is used with a good number of non-RDBMS. Should it really matter if my interface is a QBE JSON document, a JavaScript function, or a SQL procedure? That seems like one of the least important concerns.

j2d2j2d2 · on Feb 2, 2011

Just alerting readers to be aware of potential bias. What you said is a fair point, but we're after the same goal.