I'll never understand why a payment system needs to scale horizontally. Modern c...

saryant · on April 17, 2018

All comes down to uptime.

One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Then there’s deploys. Do you canary your deploys? Deploy the next release to a subset of production nodes, watch for regressions and let it ramp up from there? Okay, I’ll give you that one, it could be done on one big box.

In any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC. Deploying to shared infrastructure? Better hope none of your neighbors need that bandwidth too.

One transaction likely involves checking account and payment method status, writing audit logs, checking in with anti-fraud systems and a number of other business requirements.

(I lead a payments team, not at Uber but another major tech company)

mozumder · on April 17, 2018

> One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Wouldn't you just have multiple NICs on one box for redundancy there? With any backups being sent a database write-log for replication?

> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC.

If you're vertically scaling, wouldn't you just have the main database server host the database files locally, using fast NVMe SSDs (or Optame), in the box itself, instead of going over the network?

Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us latency. And Optane is about 4x faster. Why would a database server need to saturate network bandwidth?

Anyways, I'd love to see the actual SQL query for one of their transactions...

icebraining · on April 17, 2018

Wouldn't you just have multiple NICs on one box for redundancy there?

What happens when the FBI raids the DC to confiscate the servers of another person, and also takes yours? https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_ra...

saryant · on April 17, 2018

I'm largely referring to RPC calls, not DB queries. Many of those calls won't even be to services you control and may well be HTTP calls to other companies.

gaius · on April 17, 2018

All comes down to uptime.

20 years ago we had 1000+ days uptime on DEC kit. No one was even impressed by 500 days. Nowadays people build all sorts of elaborate contraptions to do what used to be entirely ordinary

zzzcpan · on April 17, 2018

By uptime people usually mean availability to the end users, not a literal uptime. Which also includes availability of an entire datacenter infrastructure, connectivity, internet infrastructure, making it pretty much impossible to have high availability in a singe datacenter.

gaius · on April 18, 2018

Heh, I guess. In my scenario the users actually got that uptime too, ‘cos they were connected over LAT...

saryant · on April 17, 2018

Doesn't do much good if you have to fail out of an entire data center.

gaius · on April 18, 2018

You can with VMScluster. There are multi-site clusters with 15+ years uptime.

barbegal · on April 17, 2018

Even at 250k transactions per second that's over 20 billion transactions per day which seems unlikely. 1k transactions per second (100 million per day) is probably a closer ballpark figure for a company like Uber, given they only have around 3 million drivers worldwide. The problem definitely would scale vertically however it does depends on how they interact with the API of their payment processor.

AznHisoka · on April 17, 2018

Maybe I am underestimating Uber, but when I think of enormous scale, in terms of number of transaction per day, I think Amazon, or Mastercard.

Uber? The upper bound of transactions is restricted to the total number of cab rides a day. That number isnt in the billions is it?

bradhe · on April 17, 2018

> why can't one transaction be completed in less than 1 million CPU instructions

Transactions in this system are almost certainly network bound. The relevant CPU overhead is likely trivial—-likely comparison and not arithmetic in nature even. In that context “add more NICs” is effectively an exercise in horizontal scaling. On top of that any network operation has consistency concerns to contend with.

You could contrive a system that is block device I/O bound but it’s likely to have significant network overhead as most block devices are network attached these days anyway!

nicobn · on April 17, 2018

You're being downvoted - maybe because of your tone - but I partially share your sentiment. A good counterexample to "overdistribution", not without its own set of problems, is the LMAX architecture.

See: https://martinfowler.com/articles/lmax.html

adrianmonk · on April 17, 2018

Because when your task is embarrassingly parallelizable, as payments are, you owe it to yourself to go ahead and take advantage of that to endure your part of the system isn't the bottleneck.

VexationStation · on April 17, 2018

You're being downvoted because the article actually explains this quite well.

edem · on April 17, 2018

Maybe they use javascript instead of using a real language?

bm1362 · on April 17, 2018

It's mostly Go/Java.