I think you might miss the point here, in the same way as the KDE SAs did.
What folks tend to consider the meat of ops work, often boils to a big ole boring checklist.
The problem is that you shouldn't just elect to skip a whole big section without some seriously good reasoning.
This isn't a slur on you or the KDE guys, hindsight is 20/20. I'm confident though that I'm not alone, that there are plenty of other Ops folks here who read the story and also felt the described setup violated a deep principle and just made feel ill at ease. These failure scenarios are not common, but the do happen often enough that we know to prepare for them.
As an example I'd point to how DBAs handle validation of replication - it's the same principle here.
Just for completeness, an example reason for not having proper restore procedures in place might be 'this is not the prime record copy of the data and it takes less than 24h to regenerate this data therefore this will be out of scope during restore tests'.
> I think you might miss the point here, in the same way as the KDE SAs did.
Who're clearly not under the impression they can't make mistakes considering TFA is a write-up of design flaws in a mirroring system :).
Look at it this way: If you think something obvious was overlooked, then it's good there's another report backing up your point. That's the value in everyone being open about their operations and experiences along the way - you only get better metrics for what works and what doesn't, in practice.
Yea, it is good to see a writeup like this. I'm sure some of the servers I work with don't have proper backups, but I was cringing all the same waiting for there to be a discussion about why the central git server itself couldn't be restored from a backup.
They didn't have backups for the server, only mirrors of the content (including syncing project deletes to the mirrors). They had to re-build the server and copy in the content from a mirror rather than being able to restore the server wholesale from a backup. Not having a backup is dangerous, you can't recover quickly and risk losing all your data.
There is no need for full system backups if you have backups of the relevant data and can rebuild the machine surrounding the data. For instance to restore a buildserver, you run a script that creates and provisions a new VM, clones a git repo that contains the configuration of the CI server, clones the repos it should build and the server is ready to go. No manual actions and no backups needed.
The system is designed such that the master repositories should never actually lose objects (even with force pushes and branch deletions, the admins make backup copies of the HEAD branch before letting those run so that the blobs remain in the repo).
As it turns out though there are repo tarballs generated periodically which would have served as a perfectly acceptable backup, and some other things the sysadmins could have done. The bigger shock for them was that git clone --mirror wouldn't actually run the git integrity checks (which they had mistakenly assumed).
All of the data is migrated over to alternate storage in a way which is easily retrievable, and in a form where "restoring from backup" is frequently tested.
The thing that's missing is retention of old data, but I can tell you that is fraught with its own complications. A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.
And that's assuming that a daily or weekly tarball isn't itself corrupt, which would have been the case here unless we ran git-fsck before making the copy (which is what was thought to have been getting run in some fashion in the first place).
I do fully agree that there needs to be more intelligence on the anongit side of the servers if they're to be used as viable backups instead of just sync destinations, but everyone keeps mentioning solutions to problems we don't have or null solutions to problems we actually have.
Despite what everyone seems to think we have multiple other backups of the source data (including tarball-style), but they're all crap in comparison to being able to recover from anongit.
A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.
The risk of using mirroring rather than versioned backups is that you lose all the data when a deletion is mirrored.
Yes, which is why the mirrors were affected and not the thousands of individual developers clones, nor the existing tarball snapshots.
And even that is because of a deliberate decision on the sysadmins' parts based on a misunderstanding of how git clone --mirror responds to a corrupt repo, not some simple oversight. Which is to say, countermeasures will be put in place for that as well.
I do wish people understood why having relying only on even 2-week-old backups is unacceptable in the context of a large active FOSS project's source code repository, it's not like it's OK to simply start over again from KDE 4.9.4.
>> I do wish people understood why having relying only on even 2-week-old backups is unacceptable
Yes, but i'm not convinced you see that this is EXACTLY what you are exposing yourself to.
What if next time it's a (nasty) bug in git? A push causes corruption perhaps?
Drop the idea of using git itself to host the backup strategy. Switch to plain old backups, if space (or performance - i'd wager the kde git repos must total a good few hundred GB, if not more, if there's artwork or other binaries in there too) is an issue there would be nothing wrong with incrementals for the */30 min backups.
What folks tend to consider the meat of ops work, often boils to a big ole boring checklist.
The problem is that you shouldn't just elect to skip a whole big section without some seriously good reasoning.
This isn't a slur on you or the KDE guys, hindsight is 20/20. I'm confident though that I'm not alone, that there are plenty of other Ops folks here who read the story and also felt the described setup violated a deep principle and just made feel ill at ease. These failure scenarios are not common, but the do happen often enough that we know to prepare for them.
As an example I'd point to how DBAs handle validation of replication - it's the same principle here.
Just for completeness, an example reason for not having proper restore procedures in place might be 'this is not the prime record copy of the data and it takes less than 24h to regenerate this data therefore this will be out of scope during restore tests'.