Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'll never understand why a payment system needs to scale horizontally.

Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.

A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.

Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?

Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?

And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.



All comes down to uptime.

One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Then there’s deploys. Do you canary your deploys? Deploy the next release to a subset of production nodes, watch for regressions and let it ramp up from there? Okay, I’ll give you that one, it could be done on one big box.

In any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC. Deploying to shared infrastructure? Better hope none of your neighbors need that bandwidth too.

One transaction likely involves checking account and payment method status, writing audit logs, checking in with anti-fraud systems and a number of other business requirements.

(I lead a payments team, not at Uber but another major tech company)


> One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Wouldn't you just have multiple NICs on one box for redundancy there? With any backups being sent a database write-log for replication?

> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC.

If you're vertically scaling, wouldn't you just have the main database server host the database files locally, using fast NVMe SSDs (or Optame), in the box itself, instead of going over the network?

Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us latency. And Optane is about 4x faster. Why would a database server need to saturate network bandwidth?

Anyways, I'd love to see the actual SQL query for one of their transactions...


Wouldn't you just have multiple NICs on one box for redundancy there?

What happens when the FBI raids the DC to confiscate the servers of another person, and also takes yours? https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_ra...


I'm largely referring to RPC calls, not DB queries. Many of those calls won't even be to services you control and may well be HTTP calls to other companies.


All comes down to uptime.

20 years ago we had 1000+ days uptime on DEC kit. No one was even impressed by 500 days. Nowadays people build all sorts of elaborate contraptions to do what used to be entirely ordinary


By uptime people usually mean availability to the end users, not a literal uptime. Which also includes availability of an entire datacenter infrastructure, connectivity, internet infrastructure, making it pretty much impossible to have high availability in a singe datacenter.


Heh, I guess. In my scenario the users actually got that uptime too, ‘cos they were connected over LAT...


Doesn't do much good if you have to fail out of an entire data center.


You can with VMScluster. There are multi-site clusters with 15+ years uptime.


Even at 250k transactions per second that's over 20 billion transactions per day which seems unlikely. 1k transactions per second (100 million per day) is probably a closer ballpark figure for a company like Uber, given they only have around 3 million drivers worldwide. The problem definitely would scale vertically however it does depends on how they interact with the API of their payment processor.


Maybe I am underestimating Uber, but when I think of enormous scale, in terms of number of transaction per day, I think Amazon, or Mastercard.

Uber? The upper bound of transactions is restricted to the total number of cab rides a day. That number isnt in the billions is it?


> why can't one transaction be completed in less than 1 million CPU instructions

Transactions in this system are almost certainly network bound. The relevant CPU overhead is likely trivial—-likely comparison and not arithmetic in nature even. In that context “add more NICs” is effectively an exercise in horizontal scaling. On top of that any network operation has consistency concerns to contend with.

You could contrive a system that is block device I/O bound but it’s likely to have significant network overhead as most block devices are network attached these days anyway!


You're being downvoted - maybe because of your tone - but I partially share your sentiment. A good counterexample to "overdistribution", not without its own set of problems, is the LMAX architecture.

See: https://martinfowler.com/articles/lmax.html


Because when your task is embarrassingly parallelizable, as payments are, you owe it to yourself to go ahead and take advantage of that to endure your part of the system isn't the bottleneck.


You're being downvoted because the article actually explains this quite well.


Maybe they use javascript instead of using a real language?


It's mostly Go/Java.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: