I agree completely. The current FAA outage is a good example of this. What if th...

hinkley · on Jan 11, 2023

We had a major cockup at work a few years ago caused by bad organizational choices and Conway’s Law.

We had a disk array go sideways, which is when we learned that some dumb motherfucker had put our wiki on the same SAN with production traffic. You know, the wiki where you keep all your run books for solving production issues? Everyone was furious and that team lost some prestige that day. How dumb do you have to be?

francis-io · on Jan 11, 2023

Seems a little harsh. We all overlook things like this. Things like storage are so reliable we expect them to always be available. When you lay it out like you did, it does sound silly.

namaria · on Jan 11, 2023

Two is one and one is none. This is not hard. But it's costly. The main problem with security and reliability is that they are expensive. Those operating in high margins, highly specced spaces need to do it right or they lose. Everyone else is cargo culting, box-checking for stakeholders or selling snake oil. There is a fundamental tension between optimizing for cost and doing things right, and short term gains will always come at the cost of mounting operational risks.

hinkley · on Jan 11, 2023

“mounting operational risks” is my main complaint with the “there is no maintenance” thread that was here the other day.

Maintenance is looking at all of the probability < 10^-4 issues that are just waiting for you to roll the dice enough times to eventually lose to the birthday problem. Every day you’re lowering the odds that tomorrow will be the day everything burns, because doing nothing is just a waiting game.

namaria · on Jan 11, 2023

Yeah that was such a 'social media hot take' kinda thing I didn't even feel like getting into that conversation. Misaligned interests make it a good idea to get into a project, slash things to the bone, cut a fat bonus check due to savings achieve, cut and run. That's how we got the supply chain mess of 2021...

hinkley · on Jan 11, 2023

Our operations team didn’t have visibility because some other operations team told us not to worry our pretty little heads about it. You know how you can tell when someone is so mad that they stop talking? Two of our people hit that level. That was not a comfortable room to be in.

This experience ended up being the beginning of the end for the anti-cloud element at the company. Which is too bad because I like having people who understand the physics of our architecture. Saves me from doing all sorts of stupid things myself.

tetha · on Jan 12, 2023

Yeah. For our ops-documentation I also have more of a bottom up approach of keeping it stable, which has resulted in the company having two documentation standards.

Pretty much all of our docs, and everything concerning more than one team is stored in an instance of our own software running on the software plattform. And it works well.

However, the core operational teams document their core operational knowledge in different git-based systems, all of which are backed up into the archiving as well. This way, if we really lose access to all documentation, we probably lost all workstations of a team to some incident, as well as a repository host, as well as two archive hosts in different parts of europe. At that point, the disaster recovery plan is a bar, to be honest.