Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your fail-safes are themselves the source of faults and failures.

I've seen, just to list a few:

Load balancers which failed due to software faults (they'd hang and reboot, fortunately fairly quickly, but resulting in ~40 second downtimes), back-up batteries which failed, back-up generators which failed, fire-detection systems which tripped, generator fuel supplies which clogged due to algae growth, power transfers which failed, failover systems which didn't, failover systems which did (when there wasn't a failure to fail over from), backups which weren't, password storage systems which were compromised, RAID systems which weren't redundant (critical drive failures during rebuild or degraded mode, typically), far too many false alerts from notifications systems (a very common problem even outside IT: http://redd.it/1x0p1b on hospital alarms), disaster recovery procedures which were incomplete / out of date / otherwise in error.

That's all direct personal experience.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: