> the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. [...] We didn’t want to increase the operating system limit without further testing
Is it because operating system configuration is managed by a different team within the organization?
More likely they need to understand what effect changing the thread limit would have - for example it could increase kernel memory usage or increase scheduler latency. It’s not something you want to mess with in an outage.
If you start haphazardly changing things while firefighting without testing, you might make things even worse. And there's worse things than downtime, for instance if the system appears to work but you're actually silently corrupting customer data.
Is it because operating system configuration is managed by a different team within the organization?