The number of commenters that think the KDE sysadmins were stupid enough to not know that "raid isn't a backup strategy" is depressing. You either didn't read the article or you haven't understood it.
Git repositories contain redundant information to perform consistency checks. If a bit flips in one of your repos, git fsck will immediately catch it. The KDE sysadmins thought these consistency checks were triggered when keeping a repository clone in sync and thus any FS level corruption would be caught at the first subsequent attempt to sync.
If you think "Gee, if I know this, surely they do?!" then perhaps the answer is: "They probably do and this issue is more subtle than you initially thought. Reread carefully before implying how dumb others were."
Relying on consistency checks will not save you if someone does the git equivalent of rm -rf on the master repository. Which is why mirroring (of any description) is notabackupstrategy.
Yes, git --mirror should probably automatically invoke the moral equivalent of git fsck by default in order to catch internal repository inconsistency like the other git commands do, and the KDE team have been caught out by this. But they still don't have any protection against user error leading to loss of data with this setup as far as I can see, and that seems like a huge oversight.
> Relying on consistency checks will not save you if someone does the git equivalent of rm -rf on the master repository.
The master repository is coded so that those types of actions are not permitted. Developers cannot force push or delete branches that aren't reintegrated into some other head, without assistance from the sysadmins.
After every routine push to a KDE git repository the new commit will be a descendant of the repo's original head.
With that in mind it's more understandable as to why they felt it was possible to take advantage of Git's design in the backup planning.
No "huge oversight" here, at least on this particular issue.
Ummm, still not a backup. What if there's an error in the code? What if an admin runs a script directly on the server and resets all repos by mistake? What if a bad release of git comes out and corrupts the master? What if a hacker comes in and wipes them all out?
Only a backup is a backup. Backup, like security, is somewhat onion like.
I never claimed it was a backup, only that it wasn't completely susceptible to user error as was surmised in the comment I replied to.
However, regarding the backup question, an rsync backup would have been just as damaging to the anongit mirrors as git --mirror was. The whole point to git for KDE is that that "distributed" part of the VCS would help handle backups. And it has, that part has not really been in question.
The 'luck' came in where there happened to be an anongit mirror that was fully synced up so that we didn't have to crowdsource repo restoration, which saved a lot of time and anguish.
Had we at KDE ensured that the git repository being synced to an anongit mirror was fully consistent we wouldn't even be speaking about this: the git.kde.org repos would be shut down until the box could be restored and we'd have any of 5 easy backups to choose from to restore the repositories (the rest of the files on the box would be restored from the normal backups used).
I want to stress that this is the larger point here: It's possible in some ways to corrupt a git repository and have its subcommands not notice. You must use the provided git-fsck (directly or indirectly) before backing up a git repository, especially if you don't use git for the backup, or use git-clone --mirror.
The error wasn't that we weren't doing backups, the error was that we were making corrupted backups. tar | /dev/tape will do this to you just as badly if you get the right FS corruption.
COW snapshotting filesystems can help (if they have no bugs) but the KDE sysadmins were working under the errant assumption that git would make the integrity check in situations where that wasn't true, not that backups are simply not required.
This is why, if your data is sufficiently important, you'll want to:
1) Test your backups, to detect when your backups are no longer backups.
2) Make geographically diverse backups, so a single tidal wave can't wipe out your data. For bonus points, have enough geographically diverse backups that the world is probably ending if they're all being wiped out -- at which point you have bigger problems to take care of.
3) Make backups with a diverse set of mechanisms, so the failure or compromise of one (or N-1) can't fail and compromise all backup copies. Making backups on write-only media and hiding them means current failure or compromise can't fail and compromise previous backups, and may help back your data up against theft, landlords, angry neighbors, spurned girlfriends, or even the occasional corrupt government official.
Mirroring (be it software or RAID) is not a backup system: It is far too dumb, far too happy to overwrite your old good data with new bad data. You want a history, where old good data is not replaced.
Git is not a backup system: It is a version control system. While it may have some of the properties of a backup system as goals, that is not it's primary use case. As a result we see articles like this where we've seen how it can fail in achieving the goals of a backup system as a practical matter in this very article, even when intentionally attempting to use it as a poor man's backup system in the form of mirrors.
Such problems are not unique to git, of course. On a personal note, I've managed to wipe data with both git and perforce in moments of weakness. If you want to treat me kindly about it, you could say I used both to the point where the statistics were against me not shooting myself in the foot. And, fortunately so far, the use of proper, separate backup mechanisms have always allowed me to restore the majority of my data and left me relatively unscathed.
That's kind of ridiculous. The point that most people are making is that if someone does something incredibly stupid, or there is corruption in the system that follows down the line (like what happened here), it doesn't matter whether you have a repository.
A backup of a corrupt repository would have been just as corrupt though.
This is the big thing I can't figure out what people are not understanding. git does consistency checking for you already, tar|rsync|etc. don't, so it makes sense to take advantage of that.
What we had was an instance of some of the underlying data becoming corrupt on the filesystem (with indications of that starting on Feb 22!). The big mistake was considering the source repositories as consistent and canonical at the remote anongit end, but the data would have been just as corrupt if we had scp'ed the repos from git.kde.org to the anongit mirrors around the world, since we would have bypassed git's internal checking in that way.
Is it safe to rsync a running mysql database at random times, or are you supposed to use mysql-provided tools to perform a backup?
OK, but what stops them from daily performing a mirror clone, checking it for consistency, then backing that up? As mentioned in the linked update, 30 complete backups would consume only 900GB, so you could keep weeks of daily backups, plus weekly and/or monthlies going back much further, for a terabyte of space. That way, in the worst case, you could go back to a backup before the corruption began. Obviously you would want to have plenty of safeguards in place so that that never happened, but just in case, it's good to have an honest to goodness backup too.
Relying on consistency checks will not save you if someone
does the git equivalent of rm -rf on the master repository.
There is no equivalent of rm -rf in git. If you remove files being versioned and commit them, the previous version still exists and the commit can be reverted. If you remove the metadata, no sync can happen, you are immediately warned (assuming good monitoring was set up) and can restore from one of the clones. If the entire machine crashes and burns, you whip out your git-server-create.sh, which provisions a new VM and restores from one of the clones. There is no additional data to be backup up with git. Any single developer with a recent clone of the repos can setup a new master git server.
It is impossible to blindly use the origin refs in that case. Git doesn't deal well with rewriting history. If someone force pushes to a centrally shared repo, all hell breaks loose.
| New refs are pulled down to the downstream
| repository; all updates, including forced updates,
| are mirrored to the downstream repository.
| As a result, making a mirror clone essentially
| bypasses the safety checks in the repository
And
> If someone force pushes to a centrally shared
> repo, all hell breaks loose.
This would happen to users pulling/pushing from the central repo, but not to the mirrors.
In that case another developer would notice, would get in touch with his fellow developers, they would agree to force push a specific repository condition and all is well and back to normal. There is no need to restore anything from backups or mirrors in that scenario.
So that's an entirely different threat. On the one hand we have FS corruption, which is basically bound to happen and is likely to impact many repositories. On the other hand we have someone (maliciously) force pushing an update on a single repository that throws away all commits.
The first issue is one that needs a backup and restore procedure. The second is one that is at best an inconvenience and at worst a security problem.
Unless you want to include the the worst case scenario someone gaining access to the machine and force pushing such a destructive update on all repos. In that case restoring from developer copies may actually be safer anyway, because the machine and all its backups, be they mirrors or otherwise, may be considered compromised. That's a nightmare for which this discussion about backups is just quibbling.
Surely every backup system has some equivalent of an rm -rf ? A disgruntled employee could phone the off site tape archival company and tell them to toss all the tapes in the shreader.
In the specific case of offsite tape archival, let's say someone who has the authority to do so requests that they destroy all tapes.
Most service providers (who stay in business) have enough compliance features in place so that multiple authorized people have to be in collusion, a sufficiently senior executive must make the request, or there will be a "package" ready to be served to the client and relevant policing unit (police, FBI, whatever) so that charges can be brought against the malefactor efficiently.
While you may be limited in financial compensation by your contract with the service provider, it is absolutely in their best interest to avoid the situation with procedures (they do not want to be a party to a crime) and if those fail, to provide extensive records of the movement and disposition of those tapes.
(This is missing the wider point, but regardless: AFAIK (do correct me if I'm wrong) git clone --mirror will not copy dangling references & those refs will disappear from the original after a git gc, whenever that happens.)
Human error is human error: sometimes people will do apparently egregiously stupid things if it's possible for them to do so, including things like deleting branches from a repository & then doing a git gc to "save space" shortly before realising that they weren't issuing commands to the shell they thought they were in. A backup strategy should be robust against dumb human mistakes as well as mechanical hardware errors, otherwise it's no backup strategy at all.
They say in the article that one step of the anongit sync is to delete projects from the anongit servers which have been deleted on the central server. If a user accidentally deletes a project, they should be able to recover it from a backup, but it doesn't sound like there are any backups.
The same idea applies to the individual repos themselves. A git gc will delete objects unreferenced by the reflogs, eg deleted branches over a month old by default. They say that the total set of repos is about 25GB, so it would be reasonable to keep at least monthly backups for at a year or more.
Yea, that sounds like a good idea, can you do that in something like a pre-push hook? That plus long reflogs seems like it would make recovery of rebased/deleted branches really easy for end users.
We don't allow force-pushes without admin involvement, so rebasing isn't so scary for us, but we do cover it that way. And we got the undo operation down to a command on the infra: http://community.kde.org/Sysadmin/GitKdeOrgManual#ohnoes
The rest of that Wiki page is also pretty interesting for a general look at what devs can do on git.kde.org - server-side personal clones, personal repositories, a personal trashcan for repositories, etc. We don't have a pretty web interface for it (yet?), but it has some nice features.
I think you might miss the point here, in the same way as the KDE SAs did.
What folks tend to consider the meat of ops work, often boils to a big ole boring checklist.
The problem is that you shouldn't just elect to skip a whole big section without some seriously good reasoning.
This isn't a slur on you or the KDE guys, hindsight is 20/20. I'm confident though that I'm not alone, that there are plenty of other Ops folks here who read the story and also felt the described setup violated a deep principle and just made feel ill at ease. These failure scenarios are not common, but the do happen often enough that we know to prepare for them.
As an example I'd point to how DBAs handle validation of replication - it's the same principle here.
Just for completeness, an example reason for not having proper restore procedures in place might be 'this is not the prime record copy of the data and it takes less than 24h to regenerate this data therefore this will be out of scope during restore tests'.
> I think you might miss the point here, in the same way as the KDE SAs did.
Who're clearly not under the impression they can't make mistakes considering TFA is a write-up of design flaws in a mirroring system :).
Look at it this way: If you think something obvious was overlooked, then it's good there's another report backing up your point. That's the value in everyone being open about their operations and experiences along the way - you only get better metrics for what works and what doesn't, in practice.
Yea, it is good to see a writeup like this. I'm sure some of the servers I work with don't have proper backups, but I was cringing all the same waiting for there to be a discussion about why the central git server itself couldn't be restored from a backup.
They didn't have backups for the server, only mirrors of the content (including syncing project deletes to the mirrors). They had to re-build the server and copy in the content from a mirror rather than being able to restore the server wholesale from a backup. Not having a backup is dangerous, you can't recover quickly and risk losing all your data.
There is no need for full system backups if you have backups of the relevant data and can rebuild the machine surrounding the data. For instance to restore a buildserver, you run a script that creates and provisions a new VM, clones a git repo that contains the configuration of the CI server, clones the repos it should build and the server is ready to go. No manual actions and no backups needed.
The system is designed such that the master repositories should never actually lose objects (even with force pushes and branch deletions, the admins make backup copies of the HEAD branch before letting those run so that the blobs remain in the repo).
As it turns out though there are repo tarballs generated periodically which would have served as a perfectly acceptable backup, and some other things the sysadmins could have done. The bigger shock for them was that git clone --mirror wouldn't actually run the git integrity checks (which they had mistakenly assumed).
All of the data is migrated over to alternate storage in a way which is easily retrievable, and in a form where "restoring from backup" is frequently tested.
The thing that's missing is retention of old data, but I can tell you that is fraught with its own complications. A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.
And that's assuming that a daily or weekly tarball isn't itself corrupt, which would have been the case here unless we ran git-fsck before making the copy (which is what was thought to have been getting run in some fashion in the first place).
I do fully agree that there needs to be more intelligence on the anongit side of the servers if they're to be used as viable backups instead of just sync destinations, but everyone keeps mentioning solutions to problems we don't have or null solutions to problems we actually have.
Despite what everyone seems to think we have multiple other backups of the source data (including tarball-style), but they're all crap in comparison to being able to recover from anongit.
A week-old repository tarball is almost worse than useless in the context of the git repository; we'd sooner restore that data by having a developer re-run "git push" than to lose a week's worth of development.
The risk of using mirroring rather than versioned backups is that you lose all the data when a deletion is mirrored.
Yes, which is why the mirrors were affected and not the thousands of individual developers clones, nor the existing tarball snapshots.
And even that is because of a deliberate decision on the sysadmins' parts based on a misunderstanding of how git clone --mirror responds to a corrupt repo, not some simple oversight. Which is to say, countermeasures will be put in place for that as well.
I do wish people understood why having relying only on even 2-week-old backups is unacceptable in the context of a large active FOSS project's source code repository, it's not like it's OK to simply start over again from KDE 4.9.4.
>> I do wish people understood why having relying only on even 2-week-old backups is unacceptable
Yes, but i'm not convinced you see that this is EXACTLY what you are exposing yourself to.
What if next time it's a (nasty) bug in git? A push causes corruption perhaps?
Drop the idea of using git itself to host the backup strategy. Switch to plain old backups, if space (or performance - i'd wager the kde git repos must total a good few hundred GB, if not more, if there's artwork or other binaries in there too) is an issue there would be nothing wrong with incrementals for the */30 min backups.
I think the lesson we should all reflect on here is that when designing very critical systems one should not try to be too clever or optimize to the point of the correct working of the whole system being dependent on a single assumption.
Old-school engineering used to have the concept of a safety factor, where you used a material twice as strong as the one that your equations have shown to be strong enough to sustain the conditions you expect it to operate under. Richard Hamming also put this very nicely: "Would you fly a plane if you knew it depended on some function being Lebesgue-integrable?".
Was the scenario that something unwanted might get synced one day really that improbable or so unobvious nobody considered it? Hard to say in retrospect, but I would say no. Anyway, this might happen to every one of us, so I will use the time to reflect on the systems _I_ built and whatever assumptions _I_ could have done mistakingly.
The problem is that they were not stupid, they were too clever for their own good. A stupid sysadmin would follow the 3-2-1 backup rule and have on hand at least 3 backups of the data from the past 24 hours (or sooner, if backup strategy necessitated it). The KDE sysadmins and devs thought they were super clever and could avoid this rule from operations 101, and then found out the hard way that they could not.
This is really 2 separate discussions. One is the intricacies of git. The other is ignoring operations 101 by not making proper backups. Having over a decade in ops experience, I am more focused on the latter discussion.
As someone else here said, if they had followed a braindead checklist of how to do proper backups they would not have had this problem. Standard operations procedure will save you, even if you had made faulty assumptions about your architecture.
The main point is this: they broke an ops 101 rule. They had clever reasons and assumptions for doing so. These assumptions turned out to be wrong. Now people are arguing the point. It does not change the main point, that if they had followed the ops 101 backup rule, possible disaster would not have been knocking on the door.
The fascinating thing to me about the ops 101 rule you describe: 1) it's so obviously correct, but 2) for some reason, almost nobody has a good backup plan that they actually follow, where they actually have a process to periodically validate the restore procedure works. So few people actually do this, that I've come to think of this as a basic rule to understand, but an apparently advanced one to actually follow.
You hit the nail on the head regarding overabundant cleverness. A lot of developers are too clever by half, too.
I worked for a company that actually did periodical restores of backups, and had a smaller scale version of the "Chaos Monkey" (per company procedures, someone would periodically delete on purpose a file at random - it could be part of an important product, documentation, accounts, an SQL database - and we were asked to get it back from backup).
After I got to the "real world", I'm shocked that some extremely large companies (financial institutions that should know better) don't have anything like it in place, and have far worse and untested but expensive disaster recovery plans.
How many times does it have to be said? Mirroring is NOT a backup strategy!
The number of times I've seen some sysadmin(s) base their entire organization on this faulty premise is absurd. Mostly it's because they have decided that RAID 1 or RAID 5 should be a decent "backup" strategy, but then there are those who believe mirroring systems is how to do backups.
They never, ever, take into consideration what happens when something corrupts/is deleted/is compromised. Without a way of going back in time (i.e. an actual backup) they are forever stuffed.
Sysadmins: MIRRORING IS NOT A BACKUP SOLUTION. STOP DOING THIS!!!
> Mostly it's because they have decided that RAID 1 or RAID 5 should be a decent "backup" strategy, but then there are those who believe mirroring systems is how to do backups.
I think he is implying that this is a case of the second. That is to say, I think he is saying that --mirror is not a backup strategy.
You didn't. Mirroring in this case refers to using git --mirror.
You're assuming it works like a traditional file system or block level mirror, but it doesn't. Corruption would in most cases have been caught. The weak (and accidental) link was relying on the server to give us a proper accounting of the current valid repositories.
Snapshots are stupid in the case of a content addressable, immutable data store.
You're better off asserting that your objects haven't changed (Which they weren't, and I agree that they should have been) and were valid in the first place (See above).
With snapshots, you'd invevitably want to dedupe them, which would be basically the same thing since it's append only, but with the dedupe infrastructure as another failure point.
If you're using copy-on-write snapshots, then the total size of your snapshots should be small since most of the data in said immutable content store never changes. But the benefit is a bit-error between one mirroring operation and the next doesn't overwrite your unchanged, good data on the slaves.
The problem I think needs more attention here is ext4 silently corrupting data. ZFS has it exactly right with the built-in checksumming on write and read - it can't stop a disk going bad, but it can tell you exactly what's affected and _when_ - corruption would've been caught the moment the mirroring operation tried to read back bad data (and would've faulted the process, rather then happily return bad data).
The point of a backup is to have redundancy, if backups are too integrated with their target, it becomes one complicated system together (that needs backup). This holds as a general rule.
Well, not necessarily. The issue is that filesystem corruption lead to undetected Git repository corruption, which is what made it possible to push corrupted repos to the mirrors.
It would have been just as easy to push those corrupted repos to all of the backup tapes in the rotating snapshot set. A snapshotting filesystem could be a good backup (and seems to be what one of the sysadmins is pushing for).
But even more important is to fail fast and identify git repo corruption as soon as it can be detected so that further damage can be avoided.
The KDE sysadmins are well aware of that, at least. Mutable operations that would leave dangling blobs cause a backup copy of the appropriate ref to be generated before the force-push/branch-deletion/etc. are run so that there's nothing for git to garbage collect.
Yeah, if you're incapable of accepting that complicated scenario is complicated.
The next two paragraphs identified two things that they weren't doing that they should have been. Otherwise they'd just have lots of snapshots of bad data.
For everything with a sha1 hash, I see where you're coming from. And most of the data in a repo is covered by them. But things like tags, branches and reflogs don't themselves have hashes, they are just metadata referencing content in the append-only store. It sounds like they were backing up their reflogs, which is great, so they could recover if a user, say, accidentally deleted all the branches off the central server.
You really aught to reread my comment. I mentioned RAID because it's the most common form of this mistake.
Let me repeat what I said, italicizing the important parts:
Mostly it's because they have decided that RAID 1 or RAID 5 should be a decent "backup" strategy,* but then there are those who believe mirroring systems is how to do backups.
>The root of both bugs was a design flaw: the decision that >git.kde.org was always to be considered the trusted, >canonical source.
It seems that an even bigger design flaw is that they (still) aren't doing regular backups. The mirroring of course provides some redundancy, similar to what raid does, but as they say: "raid is not a backup solution".
Backups address the problem only to the degree that they give you an older revision to revert back to. The interesting thing that happened here is that corrupted data was propagated through the mirror network, which syncs more often than backups get made, and how to prevent that. Because while having a safety net is nice, avoiding developers being inconvenienced by a failure is the real challenge.
Plus unchecked backups of corrupted data aren't worth a lot, and corruption-proof mirroring acts as further (and timely) backup.
Having a hot standby/failover is nice, but of course only the icing on top of your backup strategy.
I read the article same as the OP, it doesn't mention any backup system, only the mirroring. In fact, they did restore from a replica that was out of date in the end (from projects.kde.org). Had they had actual backups, the article would surely mention they only used projects.kde.org because it was somewhat more recent than their last backup?
The story about planning to do regular ZFS snapshots hints to the same, if they had a backup system they wouldn't need that.
edit: sorry, is that your post/do you have more insight? In that case I'm sure you know better than me speculating on what the author meant ;-)
> edit: sorry, is that your post/do you have more insight? In that case I'm sure you know better than me speculating on what the author meant ;-)
Nah, I'm not Jeff. I have some general insight because I was involved with setting up our git infrastructure in its early days, but I haven't worked on the mirroring code, and I've been out of the loop on day-to-day admin operations for a while, so I can't comment on the backup schemes that may or may not be in action on the servers right now.
They were autosyncing. They were prepared for not being able to sync (restore main system and sync back), but not for the sync to replicate a corrupt state.
Any backup strategy with only one backup (instead of preserving historical snapshots) is vulnerable this way.
Something I haven't seen anyone mention in these threads is that KDE is an open source project - therefore they have hundreds (if not thousands) of backups.
Even if all the official repos where destroyed, all they'd have to do is ask the last person who'd pulled to give them a copy of their clone.
No doubt it would be a pain to do, but no data should have been lost.
As Linus said: "Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it"
An amazing thing about git (and other DVCSs as well) is that even if a much more serious catastrophe had happened (e.g., if a nuclear bomb had struck the KDE datacenter), it would probably still be possible to reconstruct (an approximation of) the master repo, simply due to the fact that it was fully cloned on hundreds of developers' machines worldwide.
Linus Torvalds once coined an adage that "real men don't make backups. They upload it via ftp and let the world mirror it." Well, the FTP bit isn't true anymore, but otherwise DVCSs have enabled this for mere mortals.
In terms of how this would be done practically: We did have intact gitolite logs I believe, which record the credentials involved in pushing any ref and what they're getting updated to, so we'd have known what data we would have needed to locate and who we could contact to provide it. And since the commit hashes describe their content, there wouldn't have been a risk of manipulated data.
Presumably the mirrors also did not run an aggressive 'git gc' immediately after 'git remote update', so they would still have non-corrupt commits in the object store, in which case you could recover by "just" resetting any corrupt refs.
There are security considerations with that approach. Someone could have edited their copy of the repo and put in any code they wanted. If you presume the person isn't lying, and trust their code, you've just been back doored.
Git uses a rolling SHA-1 checksum which describes both content and commit logs. As long as they had the hashes of the commits, then they can be sure that anyone who has a chain of commits leading to that hash has the real content.
Yeah, but now you need to know the commit hash for all the branches. Basically, you need a form of backup. But this problem presumes you don't have much of a backup.
Oh sure, but at the end of the day some developer or group of developers is in charge of those repositories, and considered a trusted person (or effectively is).
Between them and the large number of developers who would have copies of the repo, reaching consensus on what the "true" repo was - while not easy - could be done in a secure fashion due to the hashes. You wouldn't have people declaring "no it's totally it" and not being able to verify.
The ftp mirror most likely has tarballs of the source of all recent versions. Any VCS would not give you the same protection, unless you assume that there are a bunch of developers that keep checkouts of every major version (with DVCS, that would certainly be the case. Otherwise? I don't think so.)
It would be interesting to here what someone from GitHub has to say about this, they must have dealt with this issue at some point.
Also I would be interested to know why KDE don't use something like GitHub or BitBucket? It would be cheaper for the organisation, and they could still setup web hooks to get notifications of commits.
GitHub and BitBucket don't run open software, and KDE as a community doesn't feel comfortable with putting its name behind organizations that aren't playing by the rules we've set for ourselves and consider valuable for what we're doing. We also don't think it's what our donators expect us to do with their money, and we care about being responsible with their funds. Meanwhile, some parts of our server infrastructure are actually donated as well, so it's not like we're blowing the bank on hosting -- most of KDE's expenses are in putting together and attending developer conferences. Our financial reports are open too, if you're curious: http://ev.kde.org/reports/ :)
That said, we did originally try to work with Gitorious, but that didn't come to pass for a variety of reasons. Here's a small writeup of our journey back then, some of the parts discussed there have since been swapped out or fleshed out further: https://news.ycombinator.com/item?id=2972107
And I'm really glad they're like this.
Putting every eggs in one closed-source service seems trendy these days, but in the long run it's always a huge issue.
And it's not just that you're using closed software -- you're using a closed service. Even when we were considering going with Gitorious.org, which is open source, that still left service-related concerns to address: Can you get your data out conveniently?
Because no, the data isn't all in git. For example, which set of credentials was used to write which piece of data to a repo isn't stored in git, it's logged by the authentication system in front of git.
No. Signed commits sign the commits, and because git is a DSCM, the person who created a commit doesn't need to be the person that put the commit into the repository (in fact, git further differentiates between author and committer, which is useful for "who wrote the patch vs. who put it in a commit object", but that's in the repo data).
This is in fact the main selling point of DSCMs - everything's a repo, repositories can (provided access) exchange data bidirectionally, and data can take all sorts of interesting routes from A to B involving as many repositories inbetween as you'd like.
Negotiating the actual write access to a repo happens entirely outside of git. That's really a good chunk of what software like Gitorious, GitHub or gitolite do, and then on top of that you get into the collaboration features and stuff.
Hmm. Going from how commits are formatted to include signing, I think it should be possible to strip signatures and re-add them. This would create a new commit object of course, so if you had multiple commits that needed to be pushed you would have to actually need to re-write the rest of the commits to have different parents...
It may not be kosher, and I don't think the porcelain supports it, but I think you could write a remote that only accepts commits if they are all signed by someone on a trusted list. This would probably mess up a lot of workflows though, since if person A writes some commits, person B rewrites them to sign them and then pushes them to the repo, then when person A pulls the commits "they" created they would actually be different commits.
If we were willing to take that hit though, I think it should also be possible to modify git to permit multiple signatures on a single commit. Users could take a signed commit, append their signature too it (well, technically not append. Signatures are in the middle it seems) and then push it or pass it on to others to sign. This would only allow a single commit to go through the system at once (since commits after the first would need to be rewritten, invalidating previous signatures) but on the other hand it would let you create an integration server that only accepts commits that have been signed by some number of trusted code-reviewers, while still keeping all of that information in git. So long as trusted code-reviewers all signed off on it, who actually sent it to the integration server probably would not be particularly important. Such a modification to git would probably be extremely un-kosher of course...
Aye. That's probably not a massive ask, I'm wondering more right now if current versions of git with with gpg would gracefully handle multiple signatures being present. If it couldn't be done in a backwards compatible fashion, then it's probably not worth much further consideration.
There are open alternatives to GitHub - we managed to put one together we're reasonably happy with, after all - but unfortunately the buck right now tends to stop at the hardware.
As for Hetzner: The git.kde.org master server isn't at Hetzner, nor are a bunch of the mirrors. Our infra is pretty distributed and eclectic as far as hosting locations go, partly because a lot of the resources are donated from all over. We don't "support" any hoster in particular.
It feels strange to me that so far only one commenter here (ok, he's at the top now) has identified where the main mistake seems to be:
A SYNCHED COPY IS NOT A BACKUP.
This includes:
- git
- svn
- Dropbox
- RAID (of any kind)
If you don't believe this, please reconcile everything you've learned about backups. More specifically, if you treat any VCS or dropbox as your only backup system, STOP RIGHT NOW and at least get something that's intended to be a backup, such as SpiderOak in backup mode.
To me, it sounds like the mirroring system is circumventing Git and is syncing the underlying directory structure, in which case, Git is absolutely not to blame. It's not a Git reliability issue. Had they been using "git fetch" on the mirror servers to clone from the backup servers, checking SHA1s while doing so, the issue would not have happened and the corrupt files would not have gotten silently replicated across servers.
The mirroring they were using is explicitly meant to be a fast local ad-hoc clone that doesn't do integrity checks.
They used the safe version before, because they were running into problems with the integrity checks, i.e. ref deletions and non-fast-forwards.
What they should have done was to write a hook or a script that did those non-safe updates manually (maybe only for some repositories, and some refs, don't want to rewind e.g. master).
But instead they completely bypassed the safety mechanisms and got screwed by corruption.
> Originally, mirrored clones were in fact not used, but non-mirrored clones on the anongits come with their own set of issues, and are more prone to getting stopped up by legitimate, authenticated force pushes, ref deletions, and so on – and if we set the refspec such that those are allowed through silently, we don’t gain much. A hybrid approach of a non-mirror initial clone followed by a shift to mirror mode could force the server to validate the entire repository as it packs it, so that is something worth investigating.
Non-checking backups are not a perfect solution here. Running on non-ZFS filesystems, you can get slowly building corruption in files. When you take a backup, you copy that same corruption over to your backup as well.
Going back through years of backups to find a non-corrupt copy can take a lot of time during which your service is down. Not a perfect solution by a long shot. Discovering which files have been updated and which are corrupt is also non-trivial.
Do your daily backups using rsync+hardlinks (rsnapshot, dirvish or something similar) and keep a long history. This is slower than copy-on-write ZFS (obviously), but works reliably on any Linux/Unix file system and the storage cost is roughly the same as for ZFS.
This is kind of what I'm doing as backups, but I still don't feel safe (I'm kind of paranoid for my backups):what if an attacker gets in your server and wipes out all your data and backups? And you know Murphy is always ready to strike... I'm currently looking at making regular backups offline, on DVD or blue ray disks, and automating the process. I wonder if this might be a service people are interested in. Let me know what you think... (I put a landing page at http://www.offlinebackups to test reactions)
It is never a good idea to keep backup copies at the same place as the source data, so normally it should be not that common for an attacker to be able to wipe both original and backup. Regarding the offline optical disc backups, they are still ridiculously expensive compared to magnetic spinning drives or tapes. Backup, especially an automated one, is always an extra security risk to consider, but apparently there are no other good ways...
A lot of people put their backups on S3, with a script running on the server. Even if you limit the rights with IAM to only put files, the attacker can overwrite existing files on S3. The only way I thought to prevent that is to give only write access with no listing access, and append a random number to the file name. But, who does that? I'm sure 90%+ of the servers backing up on S3 are not safe for this scenario.
If you turn on file versioning in S3, then you'll be able to get to the data that was "overwritten". I don't think there's a way for someone with only PUT access to work around this.
My backup strategy is as follows: my most important files are in my Dropbox folder. So they're both on my computer and on Dropbox' servers.
But what if my drive goes bad and pushes corrupted files to Dropbox?
That's why I have another client with Dropbox that I only turn on every week or so. I hope that if something goes wrong (including Dropbox itself wiping all my files both remotely and locally), I can still get the older versions. That, and time machine backups (they include Dropbox folder).
It's funny how even in their postmortem they don't seem to understand the obvious: live mirrors are not a backup strategy.
Git mirroring is great, and periodic checks for consistency would help, but snapshots taken and stored (offline) for reasonable periods of time are the only reasonable backup model. There are corruption issues, availability issues, etc. where offline backups are far more reasonable. Ideally you would separately cryptographically sign your backups (which is easier than just keeping track of hashes), too.
(and obviously a backup system is meaningless if you don't also check for restores periodically, and monitor the success of the whole process)
I had a few quetions.
Q1 Why not running a git fsck on the canonical server before allowing mirror servers to sync ?
Q2 could it be possible to optimize git fsck to only do incremental checks, on the diffs sent to he mirrors ?
Q3 if a canonical git server is used, why not ensuring this one is very safe against data corruption ?
Q4 what about the ext4 corruption in the VMs ? Is the cause identified ?
His last sentence about ZFS is impossible to parse. Why aren't they using it?
"I’d love to see this in use, but, after having had excellent experiences with it on Linux for a couple of years, I’m a ZFS fanboy at this point; and, I don’t know how well it’s supported on SUSE, which is the distribution git.kde.org is running (although I’ve run it on Gentoo, Debian, and Ubuntu without any problems)."
He'd love to use it as he has had an excellent experience running it on gentoo/debian/ubuntu linux, however as he's unsure of its support on suse and prefers ZFS now anyway he won't.
Ok, the elephant error is still unseen in the room - so I give you some more hints: nobody has written about this error until now - it is still there and danger of total destruction also is still there, because the single root of evil was not destroyed. You still do not see it?
This was not meant to look smart, I refuse to compete and do not feel the motivation to position myself, I speak freely here - please do not apply a ratrace-like competitive mindset, this is misunderstanding me. I just was really interested in the question, if the spof is not seen by anybody.
Of course I am willing to help - the problem is desribed clearly in the first point of the author: they are generating one "projectfile" - whatever this looks like, it is a reduction of many to one. The distribution of 1500 git repos with n-thousands of files is relying on one single file - there is no technical need for that, in fact it eliminates the power of distributed repos by reducing reliance into the presence and integrity of one single text file.
The author writes about the process of this file beeing corrupted and triggering a random repo killing process - the incident is a good bad example for what can happen if you anticipate the antipattern of making one out of many.
Building redundant systems you always try to achieve the opposite - make many out of one, to eliminate the spof. You can not scale infinitely with this, because in the end we are living on just one planet.
However, unneccessarily making one out of many is the wrongest thing you could do building a backup or code distribution system. This antipattern still exists in many places and should be eliminated.
This is not about filesystem corruption etc. - the reason for the destruction was one single project file. Do not do this. It is not critical for a backup system if it takes long time to scan a filesystem for existing folders over and over again. A backup system is not a web app, where it might be a good thing to do one out of many (aka as caching in this case), but a backup system does not need this reduction.
Git repositories contain redundant information to perform consistency checks. If a bit flips in one of your repos, git fsck will immediately catch it. The KDE sysadmins thought these consistency checks were triggered when keeping a repository clone in sync and thus any FS level corruption would be caught at the first subsequent attempt to sync.
If you think "Gee, if I know this, surely they do?!" then perhaps the answer is: "They probably do and this issue is more subtle than you initially thought. Reread carefully before implying how dumb others were."