Remember when Gitlab had their famous DB incident? From that we had some sort of...

Remember when Gitlab had their famous DB incident? From that we had some sort of an inside joke in my then-workplace. If you're gonna do something big and potentially prod breaking just "don't be _that_ guy" (said in the same spirit as "break a leg").

I became _that_ guy.

My then-workplace didn't always have enough funds, though as an employer they were generally generous especially considering their actual finances. This is relevant to the story because this employer:

1. was very lenient when it came to office attendance. So we frequently worked remotely in odd hours; that was normal. But as a matter of professionalism, I always tried to be conscientous when it came to the hours I put in. Most weeks I probably did more than usual, merits of which is another discussion entirely.

2. periodically organized events to promote the business. But being short on funds, they didn't have money to hire an actual photographer. So they'd ask me to shoot because I was interested enough in photography to, at the very least, have the gear for it.

The day I became _that_ guy they had this event I'm supposed to shoot but they really communicated the time badly to me. I expected to be able to do at least three, maybe four, hours of work before I'm needed with my camera. This is what I communicated to my TL.

Turns out they needed me _earlier_, such that I only had an hour of work done so far. Again, office culture was lenient about such things so my TL didn't really mind if I left then. The event was some kind of a big deal besides.

I'd generally start my "hours" in the afternoon, way after lunch. So by the time this event was done, it was already pretty late in the evening. I had my dinner and received a message from my TL. Nonverbatim:

"Hey can you update PostgreSQL (9->10) tonight? It shouldn't take too long and here's the steps..."

It was still in to my "usual" working hours but a couple of things that night made this request result to disaster:

1. I was tired from the event. Honest to goodness tired. I should've called it off when I couldn't even entertain myself enough to stay awake waiting for one of the given steps to finish. But I didn't because...

2. I didn't have the heart to beg off on this task when I've only done one hour of technical/engineering work for the day. To be fair, my TL always abided by the rule "Don't touch prod when tired; you will make things worse". Pretty sure he would've understood if I explained the state I was in. We could've done it the next night. But when you're tired and embarassed at having only done one hour of work for the day so far your decision making is exceptionally unsound, for lack of a stronger adjective.

Unfortunately the technical bits of this story gets fuzzy; it's been two years ago. But two years ago we have just migrated to Kubernetes and a couple of months in the team was still adjusting their mental models from servers to containers/deployments/statefulsets/pods. From just thinking between HDD vs SSD tradeoffs to Persistent Volume architecture issues. This is also why upgrading Postgres was such an ad hoc process for us then. We simply didn't know better (if something not "ad hoc" even exists).

Part of the instructions was to "delete the old data directory of Postgres" (cue: I have read this in a postmortem before...). Because I was tired and lazy I wrote a script so the update could go without my (much needed!) supervision. The instructions were sound and deletion would've been safe--assuming all the steps prior to the deletion finished successfully. It did not and I did not use `set -e`. Which meant I just deleted all the prod data in master. I was efficient. The realization woke me up harder than sugar ever did.

To cut this already long story short, I at least had the sense to concede at that point and wake up my TL with the bad news. Much like the rest of this story, what saved me that night came in twos:

1. I at least had the sense to put the site into maintenance mode.

2. I used `rm -rf`, as opposed to issuing DROP statements to psql. Which meant that my fuck-up did not replicate. So we just promoted the replica to master and downgraded the master to replica and monitored replication.

These two together ensured no data loss. Apocalypse canceled. Everyone in the company went to work in the morning none the wiser.

This story actually had a less fortunate sequel but that story is not for me to tell. And besides, I've written long enough.