To be honest I was very surprised to hear what a cache SRE was working on. It sounded like he had to build all of handling of hardware issues, rack awareness and other basic datacenter stuff himself. Does it mean that every specialized team also had to do it? Why would cache engineer need to know about hardware failures at all, its datacenter team's responsibility to detect and predict issues and shutdown servers gracefully if possible. It should be completely abstracted from cache SRE, like cloud abstracts you from it. Yet he and is team spends years on automation around this stuff using Mesos stack that they probably regret adopting by now.
I feel like in this zoomed in case of twitter caches what they were working on is questionable, but the team size seems to be adequate to the task, so my takeaway is that like any older, larger company Twitter accumulated fair amount of tech debt and there is no one to take large scale initiative to eliminate it.