In a major production launch, we moved traffic between two versions of a backend...

In a major production launch, we moved traffic between two versions of a backend with a blue/green deploy. The new version was hosted on Kubernetes, and I was pretty new to using it in production. The changeover went well, pretty great, actually. The problem came up the first time we deployed to the new infrastructure - We saw a huge spike of connection disconnects. We did not get a good answer why at the time, except the vague sense that the deployment had gone a lot faster than we intended.

The second time we deployed, I happened to glance at the deployment size immediately after deploying. For about five seconds, our deployment size went from 100 down to 2. The reason for this was simple: The "Replicas" count was specified in the deployment spec, and it was set to the size we used in our staging infra. That had been fine in prod, and was quickly overridden by our autoscaling configuration, but it did cause the Kubernetes infrastructure to take down every existing pod (minus two), then bring up a bunch of new pods very quickly.