I think you might miss the point here, in the same way as the KDE SAs did. What ...

sho_hn · on March 24, 2013

> I think you might miss the point here, in the same way as the KDE SAs did.

Who're clearly not under the impression they can't make mistakes considering TFA is a write-up of design flaws in a mirroring system :).

Look at it this way: If you think something obvious was overlooked, then it's good there's another report backing up your point. That's the value in everyone being open about their operations and experiences along the way - you only get better metrics for what works and what doesn't, in practice.

DougBTX · on March 24, 2013

Yea, it is good to see a writeup like this. I'm sure some of the servers I work with don't have proper backups, but I was cringing all the same waiting for there to be a discussion about why the central git server itself couldn't be restored from a backup.

Confusion · on March 24, 2013

I don't understand what you are referring to. What 'whole big section' was skipped? What 'deep principle' was violated?

DougBTX · on March 24, 2013

They didn't have backups for the server, only mirrors of the content (including syncing project deletes to the mirrors). They had to re-build the server and copy in the content from a mirror rather than being able to restore the server wholesale from a backup. Not having a backup is dangerous, you can't recover quickly and risk losing all your data.

Confusion · on March 24, 2013

There is no need for full system backups if you have backups of the relevant data and can rebuild the machine surrounding the data. For instance to restore a buildserver, you run a script that creates and provisions a new VM, clones a git repo that contains the configuration of the CI server, clones the repos it should build and the server is ready to go. No manual actions and no backups needed.

3amOpsGuy · on March 24, 2013

>> if you have backups of the relevant data

They didn't have that, or at least not recent enough backups.

Synching rather than snapshotting. It's a subtle difference (sync permits delete).

mpyne · on March 24, 2013

On the contrary, the backups were all too recent.

The system is designed such that the master repositories should never actually lose objects (even with force pushes and branch deletions, the admins make backup copies of the HEAD branch before letting those run so that the blobs remain in the repo).

As it turns out though there are repo tarballs generated periodically which would have served as a perfectly acceptable backup, and some other things the sysadmins could have done. The bigger shock for them was that git clone --mirror wouldn't actually run the git integrity checks (which they had mistakenly assumed).

3amOpsGuy · on March 24, 2013

>> ... the backups were ...

But what you describe, is necessarily not a backup.

EDIT: redundant text removed

mpyne · on March 24, 2013

All of the data is migrated over to alternate storage in a way which is easily retrievable, and in a form where "restoring from backup" is frequently tested.

The thing that's missing is retention of old data, but I can tell you that is fraught with its own complications. A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.

And that's assuming that a daily or weekly tarball isn't itself corrupt, which would have been the case here unless we ran git-fsck before making the copy (which is what was thought to have been getting run in some fashion in the first place).

I do fully agree that there needs to be more intelligence on the anongit side of the servers if they're to be used as viable backups instead of just sync destinations, but everyone keeps mentioning solutions to problems we don't have or null solutions to problems we actually have.

Despite what everyone seems to think we have multiple other backups of the source data (including tarball-style), but they're all crap in comparison to being able to recover from anongit.

barrkel · on March 24, 2013

A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.

The risk of using mirroring rather than versioned backups is that you lose all the data when a deletion is mirrored.

Redundant storage is not backup.

mpyne · on March 24, 2013

Yes, which is why the mirrors were affected and not the thousands of individual developers clones, nor the existing tarball snapshots.

And even that is because of a deliberate decision on the sysadmins' parts based on a misunderstanding of how git clone --mirror responds to a corrupt repo, not some simple oversight. Which is to say, countermeasures will be put in place for that as well.

I do wish people understood why having relying only on even 2-week-old backups is unacceptable in the context of a large active FOSS project's source code repository, it's not like it's OK to simply start over again from KDE 4.9.4.

3amOpsGuy · on March 25, 2013

>> I do wish people understood why having relying only on even 2-week-old backups is unacceptable

Yes, but i'm not convinced you see that this is EXACTLY what you are exposing yourself to.

What if next time it's a (nasty) bug in git? A push causes corruption perhaps?

Drop the idea of using git itself to host the backup strategy. Switch to plain old backups, if space (or performance - i'd wager the kde git repos must total a good few hundred GB, if not more, if there's artwork or other binaries in there too) is an issue there would be nothing wrong with incrementals for the */30 min backups.