Was on a call with a bank VP that had moved to AWS. Asked how it was going. Said it was going great after six months but just learning about availability zones so they were going to have to rework a bunch of things.
Astonishing how our important infrastructure is moved to AWS with zero knowledge of how AWS works.
Most startups I've worked at literally have a script to deploy their whole setup to a new region when desired. Then you just need latency-based routing running on top of it to ensure people are processed in the closest region to them. Really not expensive. You can do this with under $200/month in terms of complexity and the bandwidth + database costs are going to be roughly the same as they normally are because you're splitting your load between regions. Now if you stupidly just duplicate your current infrastructure entirely, yes it would be expensive because you'd be massively overpaying on DB.
In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.
Are the same instance sizes available in all regions?
Are there enough instances of the sizes you need?
Do you have reserved instances in the other region?
Are your increased quotas applied to all regions?
What region are your S3 assets in? Are you going to migrate those as well?
Is it acceptable for all user sessions to be terminated?
Have you load tested the other region?
How often are you going to test the region fail over? Yearly? Quarterly? With every code change?
What is the acceptable RTO and RPO with executives and board-members?
And all of that is without thinking about cache warming, database migration/mirror/replication, solr indexing (are you going to migrate the index or rebuild? Do you know how long it takes to rebuild your solr index?).
The startups you worked at probably had different needs the Roblox. I was the tech leach on a Rails app that was embedded in TurboTax and QuickBooks and was rendered on each TT screen transition and reading your comment in that context shows a lot of inexperience in large, production systems.
A lot of this can also be mitigated by going all in on API gateway + Lambda, like we have at Arist. We only need to worry about DB scaling and a few considerations with S3 (that are themselves mitigated by using CloudFront).
Are you implying that Roblox should move their entire system to the API Gateway + Lambda to solve their availability problems?
Seriously though, what is your RTO and RPO? We are talking systems that when they are down you are on the news. Systems where minutes of downtime are millions of dollars. I encourage you to setup some time with your CTO at Arist and talk through these questions.
1. When a company of Robolox's size is still in single-region mode by the time they've gone public, that is quite a red flag. As you and others have mentioned, game servers have some unique requirements not shared by traditional web apps (everyone knows this), however Roblox's constraints seem to be self-imposed and ridiculous considering their size. It is quite obvious they have very fragile and highly manual infrastructure, which is dangerous after series A, nevermind after going public! At this point their entire infrastructure should be completely templated and scripted to the point where if all their cloud accounts were deleted they could be up and running within an hour or two. Having 18,000 servers or 5 servers doesn't make much of a difference -- you're either confident you can replicate your infrastructure because you've put in the work to make it completely reproducible and automated, or you haven't. Orgs that have taken these steps have no problem deploying additional regions because they have tackled all of those problems (db read clones, latency-based routing, consistency, etc) and the solutions are baked into their infrastructure scripts and templates. The fact that there exists a publicly traded company in the tech space that hasn't done this shocks me a bit, and rightly so.
2. I mentioned API Gateway and Lambda because OP asked if in general it is difficult to go multi-region (not specifically asking about Roblox), and most startups, and most companies in general, do not have the same technical requirements in terms of managing game state that Roblox has (and are web app based), and thus in general doing a series of load balancers + latency based routing or API Gateway + Lambda + latency based routing is good approach for most companies especially now with ala carte solutions like Ruby on Jets, serverless framework, etc. that will do all the work for you.
3. That said, I do think that we are on the verge of seeing a really strong viable serverless-style option for game servers in the next few years, and when that happens costs are going to go way way down because the execution context will live for the life of the game, and that's it. No need to over-provision. The only real technical limitation is the hard 15 minute execution time limit and mapping users to the correct running instance of the lambda. I have a side project where I'm working on resolving the first issue but I've resolved the second issue already by having the lambda initiate the connection to the clients directly to ensure they are all communicating with the same instance of the lambda. The first problem I plan to solve by pre-emptively spinning up a new lambda when time is about to run out and pre-negotiate all clients with the new lambda in advance before shifting control over to the new lambda. It's not done yet but I believe I can also solve the first issue with zero noticable lag or stuttering during the switch-over, so from a technical perspective, yes, I think serverless can be a panacea if you put in the effort to fully utilize it. If you're at the point where you're spinning up tens of thousands of servers that are doing something ephemeral that only needs to exist for 5-30 minutes, I think you're at the point where it's time to put in that effort.
4. I am in fact the CTO at Arist. You shouldn't assume people don't know what they're talking about just because they find the status quo of devops at [insert large gaming company here] a little bit antiquated. In particular, I think you're fighting a losing battle if you have to even think about what instance type is cheapest for X workload in Y year. That sounds like work that I'd rather engineer around with a solution that can handle any scale and do so as cheaply as possible even if I stop watching it for 6 months. You may say it's crazy, but an approach like this will completely eat your lunch if someone ever gets it working properly and suddenly can manage a Roblox-sized workload of game states without a devops team. Why settle for anything less?
5. Regarding the systems I work with -- we send ~50 million messages a day (at specific times per day, mostly all at once) and handle ~20 million user responses a day on behalf of more than 15% of the current roster of fortune 500 companies. In that case, going 100% lambda works great and scales well, for obvious reasons. This is nowhere near the scale Roblox deals with, but they also have a completely different problem (managing game state) than we do (ensuring arbitrarily large or small numbers of messages go out at exactly the right time based on tens of thousands of complex messaging schedules and course cadences)
Anyway, I'm quite aware devops at scale is hard -- I just find it puzzling when small orgs have it perfectly figured out (plenty of gaming startups with multi-region support) but a company on the NYSE is still treating us-east-1 or us-east-2 like the only region in existence. Bad look.
Also, still sounding like you don’t understand how large systems like Roblox/Twitter/Apple/Facebook/etc are designed, deployed, and maintained-which is fine; most people don’t–but saying they should just move to llamda shows inexperience in these systems. If it is "puzzling" to you, maybe there is something you are missing in your understanding of how these systems work.
Correctly handling failure edge cases in a active-active multi-region distributed database requires work. SaaS DBs do a lot of the heavy lifting but they are still highly configurable and you need to understand the impact of the config you use. Not to mention your scale-up runbooks need to be established so a stampede from a failure in one region doesn't cause the other region to go down. You also need to avoid cross-region traffic even though you might have stateful services that aren't replicated across regions. That might mean changes in config or business logic across all your services.
It is absolutely not as simple as spinning up a cluster on AWS at Roblox's scale.
Roblox is not a startup, and has a significant sized footprint (18,000 servers isn't something that's just available, even within clouds. They're not magically scalable places, capacity tends to land just ahead of demand). It's not even remotely a simple case of just "run a script and wee we have redundancy" There are lots of things to consider.
18k servers is also not cheap, at all. They suggest at least some of their clusters are running on 64 cores, some on 128. I'm guessing they probably have a fair spread of cores.
Just to give a sense of cost, AWS's calculator estimates 18,0000 32 core instances would set you back $9m per month. That's just the EC2 cost, and assuming a lower core count is used by other components in the platform. 64 core would bump that to $18m. Per month. Doing nothing but sitting waiting ready. That's not considering network bandwidth costs, load balancers etc. etc.
When you're talking infrastructure on that scale, you have to contact cloud companies in advance, and work with them around capacity requirements, or you'll find you're barely started on provisioning and you won't find capacity available (you'll want to on that scale anyway because you'll get discounts but it's still going to be very expensive)
This was in reply to OP who said deploying to a new region is insanely complicated. In general it is not. For Roblox, if they are manually doing stuff in EC2, it could be quite complicated.
So Roblox need a button to press to (re)deploy 18,000 servers and 170,000 containers? They already have multiple core data centres, as well as many edge locations.
You will note the problem was with the software provided and supported by Hashicorp.
> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage
Not sure I agree. Yes, network costs are higher, but your overall costs may not be depending on how you architect. Independent services across AZs? Sure. You'll have multiples of your current costs. Deploying your clusters spanning AZs? Not that much - you'll pay for AZ traffic though.
The usual way this works (and I assume this is the case for Roblox) is not by constructing buildings, but by renting space in someone else's datacentre.
Pretty much every city worldwide has at least one place providing power, cooling, racks and (optionally) network. You rent space for one or more servers, or you rent racks, or parts of a floor, or whole floors. You buy your own servers, and either install them yourself, or pay the datacentre staff to install them.
Yes. If you are running in two zones in the hopes that you will be up if one goes down, you need to be handling less than 50% load in each zone. If you can scale up fast enough for your use case, great. But when a zone goes down and everyone is trying to launch in the zone still up, there may not be instances for you available at that time. Our site had a billion in revenue or something based on a single day, so for us it was worth the cost, but it not easy (or at least it wasn't at the time).
How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.
Right, outages get more expensive the larger you grow. What else needs to be thought of is not just the loss of revenue for the time your service is down but also it's affect on user trust and usability. Customers will gladly leave you for a more reliable competitor once they get fed up.
There are definitely cost and other considerations you have to think about when going multi-AZ.
Cross-AZ network traffic has charges associated with it. Inter-AZ network latency is higher than intra-AZ latency. And there are other limitations as well, such as EBS volumes being attachable only to an instance in the same AZ as the volume.
That said, AWS does recommend using multiple Availability Zones to improve overall availability and reduce Mean Time to Recovery (MTTR).
(I work for AWS. Opinions are my own and not necessarily those of my employer.)
This is very true, the costs and performance impacts can be significant if your architecture isn't designed to account for it. And sometimes even if it is.
In addition, unless you can cleanly survive an AZ going down, which can take a bunch more work in some cases, then being multi-AZ can actually reduce your availability by giving more things to fail.
AZs are a powerful tool but are not a no-brainer for applications at scale that are not designed for them, it is literally spreading your workload across multiple nearby data centers with a bit (or a lot) more tooling and services to help than if you were doing it in your own data centers.
Data Transfer within the same AWS Region
Data transferred "in" to and "out" from Amazon EC2, Amazon RDS, Amazon Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon ElastiCache instances, Elastic Network Interfaces or VPC Peering connections across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction.
Wrong. Depends on the use case AWS can be very cheap.
> splitting amongst AZ's is of no additional cost.
Wrong.
"
across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction. Effectively, cross-AZ data transfer in AWS costs 2¢ per gigabyte and each gigabyte transferred counts as 2GB on the bill: once for sending and once for receiving."
Availability Zones aren't the same thing as regions. AWS regions have multiple Availability Zones. Independent availability zones publishes lower reliability SLAs so you need to load balance across multiple independent availability zones in a region to reach higher reliability. Per AZ SLAs are discussed in more detail here [1]
(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
What he said was perfectly cogent.
Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.
Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
So, if you span multiple availability zones, you are not spared from events that will impact all of them.
> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.
It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!
The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.
If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.
During a recent AWS outage, the STS service running in us-east-1 was unavailable. Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.
For high availability, STS offers regional endpoints -- and AWS recommends using them[1] -- but the SDKs don't use them by default. The author of the client code, or the person configuring
the software, has to enable them.
The client code which defaults to STS in us-east-1 includes the AWS console website, as far as I can tell.
Real question, though - are those genuinely separate endpoints that remained up and operational during the outage? I don’t think I saw or knew a single person unaffected by this outage, so either there’s still some bleed over on the backend or knowledge of the regional STS endpoints is basically zero (which I Can believe, y’all run a big shop)
My team didn't use STS but I know other teams at the company did. Those that did rely on non-us-east-1 endpoints did stay up IIRC. Our company barely use the AWS console at all and base most of our stuff around their APIs to hook into our deployment/CI processes. But I don't work at AWS so I don't know if it's true or if there was some other backend replication lag or anything else going on that was impacted by us-east-1 being down. We had some failures for some of our older services that were not properly sharded out, but most of our stuff failed over and continued to work as expected.
> Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.
That's not true. STS offers regional endpoints, for example if you're in Australia and don't want to pay the latency cost to transit to us-east-1 [1]. It's up to the user to opt into them though. And that goes back to what I was saying earlier, you need engineers willing to read their docs closely and architect systems properly.
> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)
I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
> Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
> That is, I've read the comments to say "they're not only in different AZ's, they're in different regions"
So I've read. The earlier example about STS that someone brought up was incorrect; both I and another commenter linked to the doc with the correct information.
> It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day. When I see people here make easily falsifiable comments full of hearsay ("I had a friend of a friend who works at Amazon and they did X, Y, Z bad things") and use that to drum up a frenzy, it flies in the face of what I do everyday. There's lots of issues with cloud providers as a whole and AWS in particular but to get to that level of conversation you need to understand what the system is actually doing, not just get angry and guess why it's failing.
> > being in a different region implies being in a different availability zone.
> Availability zones do not map across regions. AZs are specific to a region. Different regions have differing numbers of AZs. us-east-1 has 3. IIRC ap-southeast-1 has 2.
Right.. So if you are in a different region, you are by definition in a different availability zone.
> You obviously feel very strongly about this. You've replied to my parent twice now. You're right that the parenthetical was harsh but I wouldn't say it's uncalled for.
Yah, I really thought about it and you're just reeking unkindness. And the people above that you're replying to and mocking are not wrong.
> Every one of these outage threads descends into a slew of easily defensible complaints about cloud providers. The quality of these discussions is terrible. I spend a lot of time at my dayjob (and as a hobby) working on networking related things. Understanding the subtle guarantees offered by AWS is a large part of my day-to-day.
If you're unable to be civil about this, maybe you should avoid the threads. Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
I've got >20 years of experience in building geographically distributed, sharded, and consensus-based systems. I think you are being unfair to the people you're discussing with. Be nice.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions).
there is a distinction between azs within a region vs azs in different regions. the overwhelming majority of services are offered regionally and provide slas at that level. services are expected to have entirely independent infrastructure for each region, and cross-regional/global services are built to scope down online cross regional dependencies as much as possible.
the specific example brought up (cross regional sts) is wrong in the sense that sts is fully regionalized as evidenced by the overwhelming number of aws services that leverage sts not having a global meltdown. but as others mentioned in a lot of ways it’s even worse because customers are opted into the centralized endpoint implicitly.
> If you're unable to be civil about this, maybe you should avoid the threads.
I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.
> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.
> I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something?
... I still don't think your overall starting assertions about the other people not understanding regions vs. AZs is correct, and it triggered you to repeatedly assert that the people you were talking to are unskilled.
I could very easily use the same words as them, and I have decade-old spreadsheets where I was playing with different combinations of latencies for commits and correlation coefficients for failures to try and estimate availability.
> Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
I remember 2011, where EBS broke across all US-EAST AZs and lots of control plane services were impacted and you couldn't launch instances across all AZs in all regions for 12 hours.
Now maybe you'll be like "pfft, a decade ago!". I do think Amazon has significantly improved architecture. At the same time, AZs and regions being engineered to be independent doesn't mean they really are. We don't attain independent, uncorrelated failures on passenger aircraft, let alone these more complicated, larger, and less-engineered systems.
Further, even if AWS gets it right, going multi-AZ introduces new failure modes. Depending on the complexity of data model and operations on it, this stuff can be really hard to get right. Building a geographically distributed system with current tools is very expensive and there's no guarantee that your actual operational experience will be better than in a single site for quite some time of climbing the maturity curve.
> Their guarantees are written on their SLA pages.
Yup, and it's interesting to note that their thresholds don't really assume independence of failures. E.g. .995/.990/.95 are the thresholds for instances and .999/.990/.950 are the uptime thresholds for regions.
If Amazon's internal costing/reliability engineering model assumed failures would be independent, they could offer much better SLAs for regions safely. (e.g. back of the envelope, 1- (.005 * .005) * 3C2 =~ .999925 ) Instead, they imply that they expect multi-AZ has a failure distribution that's about 5x better for short outages and about the same for long outages.
And note there's really no SLA asserting independence of regions... You just have the instance level and region level guarantees.
Further, note that the SLA very clearly excludes some causes of multi-AZ failures within a region. Force majeure, and regional internet access issues beyond the "demarcation point" of the service.
Yes, but the underlying point you're willfully missing is:
You can't engineer around AWS AZ common-mode failures using AWS.
The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.
Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!
They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!
Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.
Anywho, they have a lot more room in their published SLA's than you think.
Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.
Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:
0.001 (.1%) * 8760 (365d*24h) = 8.76h
Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)
> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".
Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.
>> you need to load balance across multiple independent availability zones
The only problem with that is, there are no independent availability zones.
What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.
> The only problem with that is, there are no independent availability zones.
There are - they can be as independent as you need them to be.
Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.
I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.
That's not how they work.
They exist, and work extremely well within their defined engineering / design goals. It's much more nuanced than 'everything works independently'.
Wouldn't it be possible to create fully independent zones with multiple cloud providers, like AWS, GCP, Azure? This is assuming that your workloads don't rely on proprietary services from a given provider.
Yes, and would also protect you from administrative outages like, "AWS shut off our account because we missed the email about our credit card expiring."
(But wouldn't protect you from software/configuration issues if you're running the same stack in every zone.)
There have been multiple discussions on HN about cloud vs not cloud and there are endless amount of opinions of "cloud is a waste blah blah".
This is exactly one of the reasons people go cloud. Introducing an additional AZ is a click of a button and some relatively trivial infrastructure as code scripting, even at this scale.
Running your own data center and AZ on the other hand requires a very tight relationship with your data center provider at global scale.
For a platform like Roblox where downtime equals money loss (i.e. every hour of the day people make purchases), then there is a real tangible benefit to using something like AWS. 72 hours downtime is A LOT, and we're talking potentially millions of dollars of real value lost and millions of potential in brand value lost. I'm not saying definitively they would save money (in this case profit impact) by going to AWS, but there is definitely a story to be had here.
> Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature. We have already built out the servers and networking for an additional, geographically distinct data center that will host our backend services. We have efforts underway to move to multiple availability zones within these data centers; we have made major modifications to our engineering roadmap and our staffing plans in order to accelerate these efforts.
If they were in AWS they could have used Consul across multi-AZs and done changes in a roll out fashion.
So that next time they can spend 96 hours on recovery, this time adding a split brain issue to the list of problems to deal with. Jokes aside, the write-up is quite good after after thinking about all the problems they had to deal with, I was quite humbled.
It doesn't really explain how they reached the conclusion that that would help. Like, yes, it's a problem that they had a giant Consul cluster that was a SPOF, but you can run multiple smaller Consul clusters within a single AZ if you want.
Honestly it reads to me like an internal movement for a multi-AZ deployment successfully used this crisis as an opportunity.
For example parts of AWS itself. us-east-1 having issues? Looks like aws console all over the world have issues.
You constantly hear about multi zone, region, cloud. But in practice when things break you hear all these stories of them running in a single region+zone
Surprised it was a single availability zone, without redundancy. Having multiple fully independent zones seems more reliable and failsafe.