I think you're misinterpreting the comment you're replying to. They would agree with you that the tiny SRE team described in the article sounds very effective, and likely have a lot to do with why the site is still up and running currently. Work like that should continue. But if 1-3 people can have that degree of impact, what are the other 8000 doing? (Again, this is just me attempting to interpret the point made by the parent, not trying to make one myself.)
Once you get the automation going the number itself doesn't matter that much.
You might have 200 different apps (hell, we have close to that, only 3 people in ops) but competent team will make sure they deploy in same way and are monitored in same way.
And once you go from "a server" to multiple servers, whether the end number ends up being 20 or 200 isn't that important till you start hitting say switching capacity, and if you're in cloud that's usually not your concern anyway.
Our biggest site (about dozen million users, a bunch of services and caching underneath, few gbits of traffic) took zero actual maintenance for 2022, "it just works", any job was implementing new stuff. It took some time to get to that state but once you do aside from hardware failures it "runs itself"
> Our biggest site (about dozen million users, a bunch of services and caching underneath, few gbits of traffic) took zero actual maintenance for 2022, "it just works", any job was implementing new stuff. It took some time to get to that state but once you do aside from hardware failures it "runs itself"
Nobody is adding changes that blows out the DB? or add some inefficient code that burns CPU much faster?
It's not 1-3 people. The entire SRE team globally - including the technicians and the engineers with server access - is easily going to be in the hundreds.
The SRE manager is in charge of keeping it all running. He isn't running around the world swapping out servers. He also isn't sitting back with his feet up thinking "All done - now how are my Pokemon doing?"
It's a dynamic process with quality monitoring, budgeting and reports, post-mortems, continual experiments to see if uptime can be improved, and redesigns as hardware and software change.
It's part of the backend, but is only loosely coupled to the content management and delivery system, the ad machine, moderation, marketing, and so on, all of which are going to have similarly complex structures.
it doesn't follow. The article posits that "many people think twitter headcount was bloated" then proceeds to describe a (presumably) really efficient work of a small SRE team. These two parts seem completely disconnected from each other - neither one proves, disproves or follows from the other - so it's unclear why the former was mentioned at all.
His personal experience was zero bloat. He was one deep in a critical function for the company. He isn't saying this proves without a doubt that there is no bloat, but he didn't see any in his time there. It's seems like a reasonable addition to the conversation to me.
It's a shame that most the conversation going on here are extrapolated arguments based on this article and another anecdotes. The problem starts when the ones making their points let beliefs on "how things should be" stronger instead of "how they really were".
No, I think the article makes it very clear what the value and function of SRE is. The point of the comment you're responding to is that the author was the only one doing this—not a team of ten, not even a team of two. This is Twitter's whole cache system! Probably the most important part of their hardware stack, in terms of "is the site performing well for users". There are other SRE needs at Twitter, but not that that many. What were the other 9k people at the company? It begs the question.
software doesn't break down from heat. An app I write today will run until the hardware dies. I have a palm_os app I wrote in 1998 that still runs perfectly.
"software doesn't break down from heat. An app I write today will run until the hardware dies. I have a palm_os app I wrote in 1998 that still runs perfectly."
In an organization of any appreciable size, things change all the time.. and I'm not just talking about code (for which you could have a code freeze in an emergency situation like this), but the external systems you're connected to could change for reasons completely out of your control. Content changes can break stuff because of bugs in your code. Legacy systems could require all sorts of ongoing tweaking and maintenance. And, yes, heat can break your software if the server it's running on overheats.
Agreed.. but lets say you fire 99% of your engineers, and declare a code freeze (because there's no-one left to write code)..
Then in theory.. if you own the hardware and you've locked down the libraries... That code could keep running for a long time. Agreed it's not a Palm app, but with everything locked down, I'd argue it's safe
But now I can third party stuff changing. Payment processors and such. Those don't happen fast though, and 100% not so fast that a company the size of twitter can't work out a sunsetting.
To the heat can break software if the server it's running on overheats. I have a feeling twitter's has a system in place to scale out the faulty server.
My point was, comparing code to a car is silly. A car needs maintenance. Code in code freeze does not.
Software bit-rots. That app from 1998 doesn't interact with today's world. As the world evolves around us, needs change and software has to change to keep up. That's not to say there aren't companies out there that rely on some ancient Windows 98 software program running on similarly ancient hardware, because there are. But Twitter as a piece of software isn't some static thing. It's needs are constantly changing and the software has got to keep up.
Your PalmOS app doesn't run on any modern hardware except under emulation. (Which is sad, I loved my Centro and held onto it for as long as I could.) The last release of PalmOS was in 2007, 15 years ago. Most hardware from that long ago is dead, and thus your software is dead. broken down by entropy to the hardware.
Agreed. But my comment was in comparison to a car needing maintenance. If nothing changes and I drive my car for 5 years without taking a look under the hood, it will be a mess. If not a stitch of work is done on it, I'm in trouble.
If however I have an app and I don't look under the hood for 5 years, it could still run as good as it did when I locked it down. As you said some companies run on apps written for windows98. Those apps are still working as they always did.
I don't think it needs are constantly changing. Like it could freeze for weeks/months. Leave existing bugs and put versions in lock.
I do agree that it will eventually need to change, but that's where selective hiring comes in. Oh system X isn't great. Lets find a team for that, all else remains black-boxed.
Even discounting external changes, any reasonably complex system needs maintenance because time moves on and new interactions happen.
How many SSL certificates (internal or external) need re-issuing per month? Some of that can be automated, but in an organization as large as complex as Twitter some will be bespoke and manual, and a code freeze won't stop the clock.
How many new CVEs per month apply to Twitter's services and tooling? How many race conditions or other bugs are lurking, just waiting for the right time or traffic pattern to emerge? Twitter can't freeze inbound traffic without dire consequences.
Twitter is like your car, except that it's always running.
To be honest I was very surprised to hear what a cache SRE was working on. It sounded like he had to build all of handling of hardware issues, rack awareness and other basic datacenter stuff himself. Does it mean that every specialized team also had to do it? Why would cache engineer need to know about hardware failures at all, its datacenter team's responsibility to detect and predict issues and shutdown servers gracefully if possible. It should be completely abstracted from cache SRE, like cloud abstracts you from it. Yet he and is team spends years on automation around this stuff using Mesos stack that they probably regret adopting by now.
I feel like in this zoomed in case of twitter caches what they were working on is questionable, but the team size seems to be adequate to the task, so my takeaway is that like any older, larger company Twitter accumulated fair amount of tech debt and there is no one to take large scale initiative to eliminate it.
It takes _effort_ to make it work this smoothly now, _and in the future_.
SRE is about _preventing_ issues. Not mopping up after them.
To me, the article read like every succesfull sysadmin story: there's no fires, so sysadmin must be bloat.