I think you guys are considering this from the wrong angle...
Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.
The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.
If only the new release fails, a rollback should be initiated automatically.
If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.
This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.
For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.
But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.
And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.
Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.
The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.
If only the new release fails, a rollback should be initiated automatically.
If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.