I'm not that experienced in distributed systems but just interested in a topic.
"You often can't even do 5x the number of nodes before performance and availability starts to degrade." Why? Can you give an example? So if I have 50 Casandra nodes scaling to 250 will be a problem or you mean there will be issues on a application level as a whole during that scaling?
you probably won't write your data on all 250 nodes...
assuming you have a distributed database you do not write to more than 3-9 nodes. (and even 7-9 is way too much in most cases).
if you would write to all nodes your performance would be terrible.
I also do not think that they needed more than one master in their payment system. I mean it's highly unlikely that a big machine couldn't handle all their transactions.
if it would still be a problem I would still use a old school rds and shard the hell out of it.
however keep in mind that uber did a lot of wierd things in their past, from an engineering perspective.
Well, ideally you do actually want to spread writes across all the nodes evenly, but in practice this rarely works as expected, and you do get uneven loading across the cluster. Only one of many reasons why sharding is crap.
Re: scaling Cassandra, I haven't experienced it personally, but I've heard horror stories regarding replication and load balancing when nodes start to go down. With other systems, you see things like uneven loading, running out of storage or compute, limits on iops or network bandwidth, and literally the probability of overall failures increasing because math. It's also just harder in general to manage 250 nodes rather than 50. Running commands takes longer, you need bigger infrastructure, you start running into weird limits like what your original subnet sizes were set to, etc. There are a myriad number of variables that just become more difficult the bigger things get.
Now, compare this to three giant-ass nodes. You start getting close to capacity, and maybe you add a fourth, and that lasts you another year. So. Much. Freaking. Easier. AND more reliable. There's no contest - if you can scale mostly-vertically, do it. (Unless you're Google or Facebook and have a few billion dollars to spend on engineering, in which case this actually _is_ scaling mostly-vertically, it's just datacenters instead of nodes)
Access services (e.g. browsing and searching in a product catalog) scale horizontally without any limitations. For this type of applications, 4 huge servers are more expensive, less reliable, and not significantly easier.
On the other hand services around scarce resources (e.g. a last-minute travel booking application) are indeed very hard to scale horizontally, so in such cases the parent's advice is sound.