Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From this operations engineer's perspective, there are only 3 main things that bring a site down: new code, disk space, and 'outages'. If you don't push new code, your apps will be pretty stable. If you don't run out of disk space, your apps will keep running. And if your network/power/etc doesn't mysteriously disappear, your apps will keep running. And running, and running, and running.

The biggest thing that brings down a site is changes. Typically code changes, but also schema/data changes, infra/network/config changes, etc. As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. The trick is to design it to be as immutable and simple as possible.

There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

The last thing off the top of my head that will absolutely bring a site down over time, is expired certs. If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.



It's crazy to think about, but many people who use and build software today, including HN readers/commenters, are young enough to have only been exposed to the SaaS, cloud-first era, where software built with microservices deployed from CI/CD systems multiple times per day is just the way things are done.

You're totally right; if you don't make changes to the software, it's unlikely to spontaneously stop working, especially after that first 6-12 months of "hardening" where bugs are found and patched.

Many people working in tech have never been exposed to a piece of software which isn't being constantly changed in small increments and forced upon end users. People are assuming that software is inherently unstable simply because they never use anything that isn't a "cloud service".

This probably comes off as "old man yells at cloud" but I'm not trying to bash cloud here. The cloud/SaaS approach has a ton of advantages for both consumers and businesses. But the average tech person in their 20s vastly underestimates how stable software can be when you aren't constantly pushing new features.


Absolutely. I remember we build a unimaginably brittle application many years ago, I think it was running on Windows XP and glued together a complex system with COM calls into this single page webapp even before react was a thing. It was build on a very small budget, serving the core business of a very tiny company.

Like maybe 8 years later I found out it was still humming along happily, without really even a sysadmin attending to it, on a single workstation using consumer hardware, servicing the company that had grown tenfold in size.

It blew my mind it still just worked all these years.


Your broad point is obviously correct (most outages are caused by code or config changes) but there are still classes of failures that can happen without any real changes, like various performance degradations (maybe your table grows too large) or occasional catastrophic failures from things like disk space or id overflow or something.

There's also the stability of third party systems: forced deprecations, security EOL, etc. The cert expiration stuff people have been mentioning is in this category too. I wouldn't be surprised if something does slip through the cracks at Twitter in the next 4 or 6mo.


A few of us remembers the time Slashdots 24-bit comment table ID keys overflowed, was a fun couple of days.


also: regulatory changes. if you can't function within the law, that's tantamount to a critical bug.


I remember how it was over a decade ago and one hallmark of such systems was that they were easy to exploit.

My friend in college would just go into Wordpress admin panels and the like by using common exploits because nobody updated PHP on their VPSes back then.

As someone who spent most of their career to date as a front-end developer I learned that as long as they have the budget, stakeholders are insatiable. It's just that ten years ago most of their ideas were either technically not feasible or very expensive.

Nowadays browsers are much more capable, so the pressure to produce more features is much greater.

To our own peril, we can do much more now.


The other side of that is browsers. Even if you don’t change your code, the platform people are running your code in changes, automatically in many cases. New JS or CSS behavior in next safari or chrome? You need to patch/push to accommodate running environments that are outside your control.


The old space jam site worked for ages, and would still work if they hadn’t taken it down. The web is pretty good about keeping backwards compatibility.


Except the original hamster dance, which doesn't display correctly


JS in fact has very good backwards compatability, this is by the way the reason why "old stuff" are not removed and are still in the language.


Sounds like a good argument for minimal Javascript.


The ecosystem changed since then. Now updates of your binary's environment are more frequent, and often enforced. How often did you update or patch your Windows 98 or NT?

Today, it's false to assume that fire and forget releasing will work even for standalone Windows binaries.


> The cloud/SaaS approach

You dont have to push constant updates for a cloud / SaaS product - many chose to - but ultimately you dont have to.

A year of 'no new features' should be something customers and vendor alike benefit from.


Depending on your stack, security patching can constitute a non-trivial amount of changes as well.


Another thing we noticed at Netflix was that after services didn’t get pushed for a while (weeks), performance started degrading because of things like undiscovered memory leaks, threads leaks, disks filling up. You wouldn’t notice during normal operations because of regular autoscaling and code pushes, but code freezes tended to reveal these issues.


We used to have a horribly written node process that was running in a Mesos cluster (using Marathon). It had a memory leak and would start to fill up memory after about a week of running, depending on what customers were doing and if they were hitting it enough.

The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.

We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.


USA at some point had an anti-missile system that needed periodical reboots because it was originally designed for short deployments, so the floating point variable for the clock would start to lose precision after a while.



Floating point clocks do lose precision after long enough time though; see https://randomascii.wordpress.com/2012/02/13/dont-store-that...

Storing floating point coordinates for example is what causes the "farlands" world generation behavior in Minecraft, for example.


Which, of course, led to people dying when the drift was too great.



I once managed a cluster of worker servers with an @reboot cronjob that scheduled another reboot after $(random /8..24/) hours. They took jobs from a rabbitmq queue and launched docker containers to run them, but had some kind of odd resource leak that would lead to the machines becoming unresponsive after a few days. The whole thing was cursed honestly but that random reboot script got us through for a few more years until it could be replaced with a more modern design.


This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)


"Have you tried turning it off and on again?"


You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.


The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.


Unless the OOM-killer kills the wrong process. Ages ago we had a userspace filesystem (gpfs) that was of course one of the oldest processes around and it consumed lots of RAM. When the oom killer started looking for a target, of course one of the mmfsd processes was selected and it resulted in instantaneous machine lockup (any access to that filesystem would be blocked forever in the system call which depended on the userspace daemon to return, alas never returning). Was funny to debug


You can prevent a process from being killed by OOM killer: https://backdrift.org/oom-killer-how-to-create-oom-exclusion...


If it's deployed in K8s, it would be restarted automatically after dying.


Then you have two problems.


Agreed, one of the craziest bugs I had to deal with was we had a distributed system using lots of infrastructure. Said distributed system started having trouble communicating with random nodes and sub-systems. I spent 3 hard days finding a Linux kernel bug where the ARP cache was not removing least recently accessed network addresses. Normally, this wouldn't be a big deal for a typical network because few networks would fill up the default arp cache size. That was even true for ours except that we would slowly add and remove infrastructure over the course of a couple months until eventually the ARP cache would fill and remove the random network devices... It wasn't even our distributed application code... Some bugs take time to manifest themselves in very creative ways.


Yeah, network scaling bugs are the most fun. The one I liked the most was when after expanding a pool of servers, they started to lose connectivity for a few minutes and then come back a minute or so later as if nothing happened.

Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.

Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.


vlans are convenience that is the enemy of performance and undertandability.


They're a great alternative to having to go down the datacentre mines and replug a few thousand cables, though.


Goodness, what kind of process/tools did you use to track that problem down?


My team had a similar issue with the ARP cache on AWS when we used Amazon Linux as an OS for cluster nodes, and Debian for the database host. When new tasks were starting some had random timeouts when connecting to the database.

It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.

Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).

As for the tooling we used to track it down, it was tcpdump. We saw SYN's getting sent, but not ACK's back. Few more tcpdump flags (-e shows the hardware addresses) and we discovered mismatched MAC addresses.


We didn't have much tooling outside of typical things you would find in a Linux distro. It started with trying to isolate a node having issues. Then looking at at application and kernel logs. Then testing the connection to the other node via ping or telnetting a port I knew should be open. Found out I couldn't route then just process of elimination from that point till we managed our way to looking at a full ARP cache. Tested that we could increase the ARP cache size to fix the issue. Then figured out by going through the kernel why it wasn't releasing correctly by looking at the source code for the release we were using. I'm simplifying some discovery, but there was no magic unfortunately.


If resource leaks became a serious issue I imagine they could buy time by restarting. I'm curious what the causes were for code freezes. At Meta they would freeze around Thanksgiving and NYE because of unusually high traffic.


Same, code freezes were typically around holidays (when you know traffic will be elevated, engineers will be less available and you want increased stability)


I once debugged a kernel memory leak in an internal module that manifested after around 6 years of (physical) server uptime. There are surprises lurking very far down the road.


We joked about adding this to the NodeQuark platform:

    // Fix Slow Memory Leaks
    setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)


might want to add some random jitter in there :) Imagine your entire NodeQuark cluster decided to restart at the same time


Back in the Pleistocene I worked in a ColdFusion shop (USG was all CF back then and we were contractors) and we had two guys whose job was to bounce stacks when performance fell under some defined level.


> you don't run out of disk space (from logs for example)

For a social media / user-generated content application, the macro storage concerns are a lot more important than the micro ones. By this I mean, care more about overall fleet-wide capacity for product DBs and media storage, instead of caring about a single server filling up its disk with logs.

With UGC applications, product data just grows and grows, forever, never shrinking. Even if the app becomes less popular over time, the data set will still keep growing -- just more slowly than before.

Even if your database infrastructure has fully automated sharding, with bare metal hosting you still need to keep doing capacity planning and acquiring new database hardware. If no one is doing this, it's game over, there's simply nowhere to store new tweets (or new photos, or whichever infra tier runs out of hardware first...)

Staffing problems in other eng areas can exacerbate this. For example, if automated bot detection becomes inadequate, bot posting volume goes way up and takes up an increasing amount of storage space.


> absolutely bring a site down over time, is expired certs

From today's Casey Newton's newsletter:

In early December, a number of Twitter’s security certificates are set to expire — particularly those that power various back-end functions of the site. (“Certs,” as they are usually called, serve to reassure users that the website they are visiting is authentic. Without proper certs, a modern web browser will refuse to establish the connection or warn users not to visit the site). Failure to renew these certs could make Twitter inaccessible for most users for some period of time.

We’re told by some members of Twitter’s engineering team that the people responsible for renewing these certs have largely resigned — raising concerns that Twitter’s site could go down without the people on hand to bring it back. Others have told us that the renewal process is largely automated, and such a failure is highly unlikely. But the issue keeps coming up in conversations we have with current and former employees.


I can imagine both cases being true, that the renewal process is automated and that certs won't get renewed because institutional knowledge has left the door. Where I'm at, service-to-service TLS certificates (the bulk of our certs) are automatically rotated by our deploy systems. But there are always the edge cases: the certificates manually created a long time ago (predating any standardized monitoring systems) with long expiry dates, and certificates for systems that simply can't run off the standard infrastructure. Sometimes, they'll bring down systems with low SLOs; other times, they'll block all internal development.


> the certificates manually created a long time ago (predating any standardized monitoring systems) with long expiry dates

Like the ever-popular "expires in 10 years" long-lived certificates. I've seen that happen: the VPN certificate, probably created by one of the founders 10 years ago when the company was tiny, expired one day without warning, breaking the VPN for all employees until it could be replaced (manually on every device).


> certs won't get renewed because institutional knowledge has left the door.

The parallel reality where you need to be a veteran SRE with an MIT degree to operate the arcane tool 'certbot'.


They aren't talking about the front-end certificates which expire in Feb 2023.

It's likely the ones to encrypt all of the traffic involving the Finagle micro-services, data sources, observability systems etc. And I suspect the issue there is that you are going to need to do a rolling restart.

Which I personally would not want to be doing if 90% of the company is no longer there.


The way TLS was integrated into Finagle, most services should not need to be restarted to pick up and use their new certs. That said, there are certain core services that will require manual intervention, and there will inevitably be some services that should auto-update but do not.


> There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc.

In my experience as a data engineer, unusual states are one of the leading causes of issues, at least after something is built for the first time. You can spend half a year running into weird corner cases like "this thing we assumed had to always be a number apparently can arbitrarily get filled in with a string, now everything is broken."

Also, conditions changing causing code changes is the norm, not the exception, definitely in the beginning but also often later. Most services aren't written and done - they evolve as user needs evolve and the world evolves.


> As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. > ... > There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

Aren't these changes inevitable, though? There is no such thing as bug free code.

Another thing that forces consistent code changes is compliance reasons- any time a 0-day is discovered or some library we're using comes out with a critical fix, we would have to go update things that hadn't been touched sometimes in years.

At my last job, I spent a significant amount of time just re-learning how to update and deploy services that somebody who left the company years ago wrote, usually with little-to-no documentation. And yes, things broke when we would deploy the service anew, but we were beholden to government agencies to make the changes or else lose our certifications to do business with them.

Eventually, Twitter will have to push code changes, if only to patch security vulnerabilities. Just waiting for another Heartbleed to come around...


Software never goes stale, it's the environment around it which stales.

Something from the 70s works perfectly fine, except it can't run on anything bare any longer, and the hard drives etc. have all long since failed or their PSU capacitors have blown....so Twitter will absolutely rot, how fast depends on several factors.

I personally suspect the infrastructure used to build Twitter will rot faster than Twitter itself, and of course the largest most dramatic source of rot is the power required to run it - several large communities have abandoned it already, making it less much less relevant, meaning the funding for it will also dry up, meaning more wasted cpu cycles and the like.

Thats of course assuming its left in some sort of limbo, it doesnt sound like thats the case with the current management, its only a matter of time before it topples over from shitty low-rate contractor code. Honestly, the app worked like so much hot garbage already, I could see it falling over itself and imploding with a couple poorly placed loops...


This assumes security doesn’t matter. You can’t run on stale code and be secure for too long, at least for anything non trivial. I imagine even if Twitter doesn’t add any functionality at so, it will still take hundreds of patches per Yasser.


Something from the 70's was not connected to the internet where millions of people are using it every day and finding every single edge case, or they are trying to break into it to steal valuable data. It was definitely not beholden to the same government regulations as a social media site running in the 21st century.


But real world conditions can force code changes. For example, a region abandons daylight savings time or a court order on copyright infringement. Someone unqualified working a system they are unfamiliar with could blow it up. Losing that knowledge of how the system works is a risk.


> But real world conditions can force code changes

Security fixes.


Wasn't last year where a bug in a Java well-used lib (and pretty easy to use) caused mayhem in every SRE/SEC services in every banks? It was a fun two weeks. I worked like 50 hours the first three days.


Log4j


An example where something that correlates with time can reveal pre-existing bugs long after the system was chugging along just fine: counter limits/overflows.

Simple example: you have a DB with a table with an auto incrementing table. You chose a small integer type for the primary key and after years this just worked fine, you finally saturate that integer type you can no longer insert rows in the table. Imagine now this has cascading effects in other systems that depend on this database indirectly and you end up with an "outage"


> The biggest thing that brings down a site is changes

Absolutely agreed. In that vein, there is such a thing as too much automation. Sometimes, build chains are set up to always pull in the newest and the freshest -- and given the staggering number of dependencies software generally has, this might mean, small changes all the time. Even when your code does not change, it can eventually break.

It's been my experience that a notable part of software development (in the cloud age, anyway) is about keeping up with all the small incremental changes. It takes bodies to keep up with this churn, bodies which twitter now does not have.

It'll be interesting to keep observing this. So far it's been a testament to the teams that built it and set up the infra -- it keeps running, despite a monkey loose in a server room. It's very impressive.


"Outages": this is an enormous ellipsis.

* Power outages and general acts of God

* Resource utilization

How do your databases perform when their CPUs are near capacity? Or disks? Or I/O? I've seen Postgres do some "weird s%$#": where query times don't go exponential but they go hockey stick.

* Fan-out and fan-in

These can peg CPU, RAM, I/O. Peg any one of these and you're in trouble. Even run close to capacity for any one of these and you're liable to experience heisenbugs. Troublesome fan-out and fan-in can sometimes be a result of...

* Unintended consequences

The engineering decision made months or years ago may have been perfectly legitimate for the knowledge available at the time. However, we live in a volatile, uncertain, complex, and ambiguous (VUCA) world; conditions change. If your inputs deviate qualitatively or quantitatively significantly, you risk resource utilization issues or, possibly, good ol' fashioned application errors.

"No battle plan survives contact with the enemy." -- Stormin' Norman

Same with software systems. They're living entities that can only maintain homeostasis so long as their environment remains predictable within norms. Deviate far enough from that and boom.


Any sort of cached object expiring might bring the servers down. Who knows when the Death TTL will come?


I worked as an engineer for a very large non tech company (but used a lot of tech, both bought and in-house). We had 100s of teams supporting services, internal apps (web and mobile), external apps (web and mobile), and connections to vendors plus a huge infrastructure in the real world that interconnected to all of this. One time someone changed something in a single data center (I vaguely remember some kind of DNS or routing update) and every single system worldwide failed in a short time. Even after the issue was resolved, it took most of a day and hundreds of people to successfully restart everything, all while our actual business had to continue without pissing off all of our customers. The triage was brutal as to what mattered most.

You can't do this without a lot of people. Sure you could pare it down, maybe improve some architecture, but without a ton of people involved who understand the systems and how they connect, when things might go south they may never return.


I have an old project I gave up on - haven't touched it, done any code changes or maintenance in... almost a decade? At least a stubborn client is still using it, successfully. And it's not an old guy in a living room, but an honest small sized company that has this software as the core of its operations.

So yeah, I totally agree with you. No code changes = long life.


You should be proud. I hope that one day some software I write can serve people for that long.


You didn't mention data scale. Just because the disks have room, doesn't mean the data access patterns in perfectly stable code will perform well at continual multiples if old data isn't somehow moved to colder storage.


> There are other things that can bring a site down, like [...] too much traffic[.] But generally speaking those things are rare and don't bring down an entire site.

I agree with your assessment, but I do want to highlight that this condition is not rare for Twitter. Load is very spiky, sometimes during predictable periods (e.g., the World Cup, New Year's Eve) and sometimes during unpredictable periods (e.g., Queen Elizabeth II's death, the January 6th US Capitol attack). It isn't going to cause a total site failure (anymore), but it can degrade user experience in subtle or not-so-subtle ways.

An aside on the "anymore", there was a time when the entire site did go down due to high-traffic events. A lot of the complication in the infrastructure was built to add resiliency and scalability to the backend services to allow Twitter to handle these events more gracefully. That resiliency is going to help keep the services up even if maintenance is understaffed and behind a learning curve.


Sorry for hijacking your expertise, but why no mention of memory leaks? In my experience they can cause really weird bugs not obvious at first, and are difficult to reproduce, i.e. triggered by edge cases that happen infrequently. Or are you assuming services automatically restart when memory is depleted?


It depends how well the service was "operationalized":

1) Best case: Monitoring of the service checks for service degradation outside of a sliding window. In this case, more than X percent of responses are not 2xx or 3xx. After a given time period (say, 30 minutes of this) the service can be restarted automatically. This allows you to auto-heal the service for any given "degradation" coming from that service itself. (This does not detect upstream degradation, of course, so everything upstream needs its own monitoring and autohealing, which is difficult to figure out, because it might be specific to this one service. The development/product team needs to put more thought into this in order to properly detect it, or use something like chaos engineering to see the problem and design a solution)

2) If you have a health check on the service (that actually queries the service, not just hits a static /healthcheck endpoint that always returns 200 OK), and a memory leak has caused the service to stop responding (but not die), the failed health check can trigger an automatic service restart.

3) The memory leak makes the process run out of memory and die, and the service is automatically restarted.

4) Ghetto engineering: Restart the service every few days or N requests. This extremely dumb method works very well, until you get so much traffic that it starts dying well before the restart, and you notice that your service just happens to go down on regular intervals for no reason.

5) The failed health check (if it exists) is not set up to trigger a restart, so when the service stops responding due to memory leak (but doesn't exit) the service just sits there broken.

6) Worst case: Nothing is configured to restart the service at all, so it just sits there broken.

If you do the best practice and put dynamic monitoring, a health check, and automatic restart in place, the service will self-heal in the face of memory leaks.


> If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.

At least for expired certs, most people have learned the hard way just how bad that is, and either implemented automated renewal (thank to heavens for cert-manager, LetsEncrypt, AWS ACM and friends) or where that doesn't work (MS AD...) monitoring.


I'll add one: when usage scales beyond anticipated levels. then that code that is "good enough" will no longer be, and serious intervention may be required - by senior engineers with history.


Takes me back to the first broken mess of an environment I worked in. Change freezes were a day of life and lasted and, magically, nothing would break during that time.

Now, those change freezes even extended to preventative maintenance, one of the dual PSUs in a core switch went bad and we couldn't get an exception to replace it... for 6 months. We got an exception when the second one went down and we had to move a few connections to its still alive mate.


>The biggest thing that brings down a site is changes.

Well, Elon is talking about a massive amount of changes coming down the pipe, so I guess we'll see how that goes!


I think that without code push they won’t be able to maintain compatibility. With updated APIs from third parties, new hardware, new encryption requirements from clients or browsers etc. It’s a slow descent into chaos indeed.


A browser update is a form of "new code". It's rare, but having to work around newly introduced browser bugs does happen.


And vulnerabilities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: