"Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance."
So, vacuum wasn't given the resources to keep up with their load, and it's not clear if they were supplementing with manual vacuums during quiet times. Nor was it clear when they started reacting, as PostgreSQL (also outlined in their documentation link), will start squawking about wraparound well before it shuts down, or the monitoring they had in place for whether or not autovacuum/vacuum was keeping up with the workload, or the number of locks their application was taking (locks can block autovacuum).
Adjusting the autovacuum settings by table will give you finer control over the postgresql.conf parameters to better match the workload for specific tables, as well.
Partitioning would have also helped make the actively written part of the table smaller, and older data could be vacuumed with VACUUM FREEZE, or deleted later. There are extensions to help make that easier.
Regarding monitoring, check_postgres.pl would give them an idea that their vacuuming settings needed adjustment sooner, or their application locking needed to be adjusted to not block autovacuum.
PostgreSQL ... will start squawking about wraparound
...
or the monitoring they had in place
If only there was a tool or service they could use that could monitor log files for signs of problems. (I couldn't resist being a little snarky, the situation seems so perfect.)
All kidding aside, kudos to Sentry for being so candid publicly about their problems. We've seen so many companies avoid providing any technical information. About the best we'll hear in some of these is that the problem was "not terrorism related".
We actually knew about the problem with delay and had been working to improve it. We were a couple days away from failing over to the new hardware (safely) and unfortunately we didn't have any early warnings in the logs. I haven't yet looked at why.
We had at it 50ms on the previous setup, though I wish we I knew why that value was used. Likely it was a default with the Chef cookbook we forked off of, or we read something that convinced us at the time it was a good idea.
"Additionally we had been running with far too much delay in vacuuming which meant lower load on the system, but more idle time in maintenance."
So, vacuum wasn't given the resources to keep up with their load, and it's not clear if they were supplementing with manual vacuums during quiet times. Nor was it clear when they started reacting, as PostgreSQL (also outlined in their documentation link), will start squawking about wraparound well before it shuts down, or the monitoring they had in place for whether or not autovacuum/vacuum was keeping up with the workload, or the number of locks their application was taking (locks can block autovacuum).
Adjusting the autovacuum settings by table will give you finer control over the postgresql.conf parameters to better match the workload for specific tables, as well.
Partitioning would have also helped make the actively written part of the table smaller, and older data could be vacuumed with VACUUM FREEZE, or deleted later. There are extensions to help make that easier.
https://github.com/keithf4/pg_partman
Josh Berkus of PgExperts has some great advice here:
https://dba.stackexchange.com/questions/21068/aggressive-aut...
So does Jim Nasby, here:
http://bluetreble.com/2014/10/postgres-mvcc-and-a-look-at-va...
Regarding monitoring, check_postgres.pl would give them an idea that their vacuuming settings needed adjustment sooner, or their application locking needed to be adjusted to not block autovacuum.
https://bucardo.org/check_postgres/check_postgres.pl.html#tx...
https://bucardo.org/check_postgres/check_postgres.pl.html#la...
https://bucardo.org/check_postgres/check_postgres.pl.html#lo...
It's easy to mention things after the fact, though, and it's good they got things up and running again for their customers.