> The nodes in an EBS cluster are connected to each other via two networks. The primary network is a high bandwidth network... The secondary network, the replication network, is a lower capacity network used as a back-up network... This network is not designed to handle all traffic from the primary network but rather provide highly-reliable connectivity between EBS nodes inside of an EBS cluster.
During maintenance instead of shifting traffic off of one of the redundant routers the traffic was routed onto the lower capacity network. There was human error involved but the network issue only provoked latent bugs in the system that should have been picked out during disaster recovery testing.
Automatic recovery that isn't properly tested is a dangerous beast; it can cause problems faster and broader than any team of humans are capable of handling.
During maintenance instead of shifting traffic off of one of the redundant routers the traffic was routed onto the lower capacity network. There was human error involved but the network issue only provoked latent bugs in the system that should have been picked out during disaster recovery testing.
Automatic recovery that isn't properly tested is a dangerous beast; it can cause problems faster and broader than any team of humans are capable of handling.