Not a question necessarily about the technical side, but I'm interested in your opinion as to the root cause – is it desire to achieve certain results for marketing purposes, lack of understanding/training in the team about distributed systems, just bugs and a lack of testing...? Alternatively does most of this come down to one specific technical choice, and why might they have made that choice?
Very happy for (informed) speculation here, I recognise we'll probably never know for certain, but I'm interested to avoid making similar mistakes myself.
There's a few things at play here. One is talking only about the positive results from the previous Jepsen analysis, while not discussing the negative ones. Vendors often try to represent findings in the most positive light, but this was a particularly extreme case. Not discussing default behavior is a significant oversight, and it's especially important given ~80% of people run with default write concern, and 99% run with default read concern.
The middle part of the report talks about unexpected but (almost all) documented behavior around read and write concern for transactions. I don't want to conjecture too much about motivations here, but based on my professional experience with a few dozen databases, and surveys of colleagues, I termed it "surprising". The fact that there's explicit documentation for what I'd consider Counterintuitive API Design suggests that this is something MongoDB engineers considered, and possibly debated, internally.
The final part of the report talks about what I'm pretty sure are bugs. I'm strongly suspicious of the retry mechanism: it's possible that an idempotency token doesn't exist, isn't properly used, or that MongoDB's client or server layers are improperly interpreting an indeterminate failure as a determinate one. It seems possible that all 4 phenomena we observed stem from the retry mechanism, but as discussed in the report, it's not entirely clear that's the case.
I get the impression that MongoDB may have hyped themselves into a corner in the early days with poorly made (or misleading) benchmarks. Perhaps they have customers with a lot of influence determining how they think about performance vs consistency.
Maybe this combined with patching, re-patching, re-patching again their replication logic/consistency algorithm means that they'll be stuck in this sort of position for a long time.
Possibly! You're right that path dependence played a role in safety issues: the problems we found in 3.4.0-rc3 were related to grafting the new v1 replication protocol onto a system which made assumptions about how v0 behaved. That said, I don't want to discount that MongoDB has made significant improvements over the years. Single-document linearizability was a long time in the works, and that's nothing to sneeze at!
Very happy for (informed) speculation here, I recognise we'll probably never know for certain, but I'm interested to avoid making similar mistakes myself.