I think you guys are considering this from the wrong angle... Rollbacks should *...

Silhouette · on July 13, 2019

This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.

For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.

But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.

And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.