I'll never understand why a payment system needs to scale horizontally.
Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.
A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.
Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?
Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?
And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.
One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.
Then there’s deploys. Do you canary your deploys? Deploy the next release to a subset of production nodes, watch for regressions and let it ramp up from there? Okay, I’ll give you that one, it could be done on one big box.
In any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC. Deploying to shared infrastructure? Better hope none of your neighbors need that bandwidth too.
One transaction likely involves checking account and payment method status, writing audit logs, checking in with anti-fraud systems and a number of other business requirements.
(I lead a payments team, not at Uber but another major tech company)
> One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.
Wouldn't you just have multiple NICs on one box for redundancy there? With any backups being sent a database write-log for replication?
> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC.
If you're vertically scaling, wouldn't you just have the main database server host the database files locally, using fast NVMe SSDs (or Optame), in the box itself, instead of going over the network?
Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us latency. And Optane is about 4x faster. Why would a database server need to saturate network bandwidth?
Anyways, I'd love to see the actual SQL query for one of their transactions...
I'm largely referring to RPC calls, not DB queries. Many of those calls won't even be to services you control and may well be HTTP calls to other companies.
20 years ago we had 1000+ days uptime on DEC kit. No one was even impressed by 500 days. Nowadays people build all sorts of elaborate contraptions to do what used to be entirely ordinary
By uptime people usually mean availability to the end users, not a literal uptime. Which also includes availability of an entire datacenter infrastructure, connectivity, internet infrastructure, making it pretty much impossible to have high availability in a singe datacenter.
Even at 250k transactions per second that's over 20 billion transactions per day which seems unlikely. 1k transactions per second (100 million per day) is probably a closer ballpark figure for a company like Uber, given they only have around 3 million drivers worldwide. The problem definitely would scale vertically however it does depends on how they interact with the API of their payment processor.
> why can't one transaction be completed in less than 1 million CPU instructions
Transactions in this system are almost certainly network bound. The relevant CPU overhead is likely trivial—-likely comparison and not arithmetic in nature even. In that context “add more NICs” is effectively an exercise in horizontal scaling. On top of that any network operation has consistency concerns to contend with.
You could contrive a system that is block device I/O bound but it’s likely to have significant network overhead as most block devices are network attached these days anyway!
You're being downvoted - maybe because of your tone - but I partially share your sentiment. A good counterexample to
"overdistribution", not without its own set of problems, is the LMAX architecture.
Because when your task is embarrassingly parallelizable, as payments are, you owe it to yourself to go ahead and take advantage of that to endure your part of the system isn't the bottleneck.
Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.
A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.
Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?
Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?
And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.