I once wrote a new version of a config generator and pusher for a small part of a major service. I knew data pushes were the largest global outage vector at my company, so I wrote carefully conservative validation logic and unit tested it. But I never tested what the caller did when the validation failed, and I had a dumb mistake there. It pushed an empty file, which was worse than pushing the allegedly-invalid config. Oops. That was a ~30 minute outage of the aspect of the service controlled by this config.
Of course an outage is never caused by one mistake. That mistake was mine, so I felt badly about it. There were also mistakes in code reviews, validation in the part receiving the config, and operational procedures. And then the big one: the company as a whole was in this awkward phase where everyone knew quick global pushes were bad but there wasn't good common tooling to support doing staged config files easily. That was the worst mistake behind dozens if not hundreds of major outages.
Of course an outage is never caused by one mistake. That mistake was mine, so I felt badly about it. There were also mistakes in code reviews, validation in the part receiving the config, and operational procedures. And then the big one: the company as a whole was in this awkward phase where everyone knew quick global pushes were bad but there wasn't good common tooling to support doing staged config files easily. That was the worst mistake behind dozens if not hundreds of major outages.