On undoing, fixing, or removing commits in git

sisk · on Dec 15, 2013

Regarding losing data: it's as simple as diving into the reflog. In order to remove something from your history, you must do so very explicitly by walking your commit history, editing each one. There is an automated workflow to accomplish that (`filter-branch`) but it's definitely not a command anyone I know has committed to memory.

Accidental mutations can be undone either by `--abort`ing (if the command supports it) or by checking out an earlier revision from the reflog.

The GC in git is pretty conservative and, while it can be triggered manually, still makes you jump through some hoops to actually get rid of something. Steve Klabnik wrote about it[1] a little while back.

In certain cases, you don't have access to the reflog because a change wasn't made locally. Perhaps someone screwed up a remote you pull from and it destroyed your history. You can, even still, find, view, and re-associate orphaned objects. Yeah, it's not terribly intuitive and, again, not a workflow anyone has probably committed to memory, but the fact that you can recover from a disaster of that magnitude is pretty amazing.

git provides we developers with a set of tools—powerful tools—and that comes with a level of responsibility. I'd rather have the ability to responsibly clean my history than the alternative.

[1] - http://words.steveklabnik.com/git-history-modification-and-l...

mikeash · on Dec 15, 2013

"Strongly consider taking a backup of your current working directory and .git to avoid any possibility of losing data as a result of the use or misuse of these instructions."

WTF?

What is the point of a version control system if you have to take backups of it to avoid losing data when performing certain operations?

I use git, I like git, but certain aspects of it are fundamentally broken.

gemma · on Dec 15, 2013

No, that advice from the article is fundamentally broken. Outside of the garbage collection system (which runs by default after what, 30 days? 90?), Git doesn't delete committed content. Any commit you "lose" through rebasing, amending, resetting, etc. can always be recovered. It's a little more complicated than renaming a directory, sure, but it's important, and it's not something a Git tutorial should ignore.

Git IS safe, and ANYTHING involving changes to history can be undone without resorting to backups. Data loss can occur when you're mucking about with uncommitted changes, but that's a risk in most other version control systems as well.

jordanscales · on Dec 15, 2013

Surprised to see no one in the comments has mentioned the reflog [0]. It's really very easy.

[0]: http://jscal.es/2013/08/05/seriously-the-reflog-isnt-that-sc...

crystaln · on Dec 15, 2013

I'm not 100% sure this is true, however it is also a fundamental flaw of git. There should be a way to remove commits permanently in order to remove mistakenly checked in large files or private content.

It's also definitely not true with uncommitted changes, including gitignored files.

gemma · on Dec 17, 2013

I still don't see the "fundamental flaw". Non-reachable commits are automatically deleted by the garbage collection system, which can be also be run manually. Accidental commits with large files or private content can be "modified" (technically copied and rewritten, since individual commits are immutable) with rebase, amend, filter-branch, etc. Those operations make the original commits unreachable, so garbage collection takes care of deleting them.

And like I already said above, data loss can occur when you're working with uncommitted changes, just like in most other version control systems. If the content is not under version control (in this case, not in a git commit), it's not safe.

Honestly, you guys should go watch Linus Torvalds' presentation at Google about Git. The entire point, the massive problem he was trying to solve, was preservation and verification of data integrity.

lomnakkus · on Dec 15, 2013

git filter-branch will let you remove content permanently and irrecoverably if you really need to.

Regarding uncommitted changes: This is in the same category as forgetting to do your backup before starting to mess around, IMO. I would encourage anyone to simply get used to committing extremely often and just using a quick interactive rebase before pushing.

mtdewcmu · on Dec 16, 2013

At the last job where I used git, I'd work in a separate branch, and I started using `git merge --squash` to merge into the main branch to keep the history from getting too difficult to follow. When git merges a bunch of different histories into one, it becomes almost impossible to make sense of if people make lots of small commits. I shy away from `git rebase`, because it seems dangerous.

lomnakkus · on Dec 17, 2013

Never fear! "git rebase" isn't nearly as dangerous as many have been led to believe... unless you start rebasing things you've already published/pushed elsewhere. In that case you need to be very proactive about notifying everyone who could possibly have checked out your branch, etc. Otherwise: it certainly takes a little getting used to, but I find a little one-on-one "mini-mentoring" others with the first few rebases helps them immensely, so if you have someone who can help you in person it might be a lot easier to get comfortable with the process that way.

crazygringo · on Dec 15, 2013

I understand your puzzlement, I found this confusing too at first. But then I realized it makes sense -- one of git's strengths is that you can rewrite the history. The "point" of a version control system, at least with git, is not backup which retains all history, but rather versioning which retains the history you want to retain.

Obviously, if you choose not to edit the history, then you never need to back up in this sense, and you're free to do that. But then you can't ever go back and change things (like remove accidentally committed passwords, etc.)

But if you choose to rewrite the history, and mess up, then you'll be glad you had a backup. And (in response to other comments), even if there are ways of still retrieving/fixing data, it's often easier to just restore from your backup, especially when you're trying out git commands for the first time, and you're not entirely sure if they'll work exactly how you expect. None of us are git experts from the beginning, and I've resorted to git backups numerous times when trying out a command for the first time, and then discovering it wasn't the right way.

simcop2387 · on Dec 15, 2013

An easy way to do that, is the way I tend to do it; Create a new branch based off the one you're rewriting history in, and that will actually keep all of that for you even after you rewrite it all. Makes it really easy to restore later with git reset if you need it.

Myrmornis · on Dec 15, 2013

Yes, this is how beginners should be taught to "back up" in git.

simcop2387 · on Dec 17, 2013

In the interest of making this easier:

    git branch branchname-backup
    #do dangerous stuff ...
    # whoops I just broke the branch really bad I'll start over
    git reset branchname-backup

zimbatm · on Dec 15, 2013

Actually `git reflog` contains the HEAD history. Even after a rebase it's possible to checkout to an old commit (unless git has garbage-collected).

_ikke_ · on Dec 15, 2013

Git is quite safe, and most operations that involve doing things to history can be undone. Unsafe operations happen when the working tree and uncomitted changes are involved.

Also, sometimes it's easier for a user to roll back to an older back up than to untangle the mess they have created.

Third, git itself is not a backup. When your repository gets corrupted, you're out-of-luck when you don't have backups for those files. So it's still good to take backups of your repositories.

mikeash · on Dec 15, 2013

First, your use of the word "most" is inherently incompatible with the phrase "quite safe".

Second, why would a version control system make it so difficult to roll back to an old version that it's easier to restore from backup? This is insane.

Third, I'm well aware of this, and of course you should be making backups of your git repositories (and everything else). But those backups should be there to protect against hardware failure and other external data-loss events, not protect against git itself.

pyre · on Dec 15, 2013

You're discounting the idea that someone might want to destructively rewrite their history. Here's an example: What if you want to retain history, but remove a password that was hardcoded into a source file?

The simple options are:

- Remove the hard-coded password, and create a new repository with the current state of the code as a starting point.

- Start a new repository with the current code state, but keep the old repository around under lock-and-key, then perform 'complex' patch operations to move changes between the two repositories (e.g. roll back to a previous version of a file before the cut-off).

- Go back through your history, and manually create a new repository from each patch, but removing the password when you get to that commit.

If git always preserves all history, no matter what, then these are your only options.

While operations like `git-filter-branch` sound scary, they don't delete the commit objects from your .git folder. If you created a new branch called (e.g.) master-old because running `git-filter-branch` on your repository, then you can always 'rollback' to master-old if you end up in failure. Or slightly more complex, you could use the reference listing in the reflog to 'rollback' the changes.

mikeash · on Dec 15, 2013

I'm not discounting it, I simply don't agree with how git implements it.

IMO the correct option is to create a new repository that has the same history as the old repository minus the offending commit (or possibly with an edited version of that commit that leaves out the offending string).

Because it creates a new repository, there's no risk of data loss in your old repository. Once you're confident that the operation succeeded, you can swap them.

I haven't had to do this for a long time, but as I recall, this is basically how svn does it. It works fine.

The problem with git is that it makes this far too easy and it works by editing existing repositories rather than creating new ones. So instead of once-in-a-blue-moon repository hacking to get rid of that password you accidentally committed, you get people rewriting history because they think the real history isn't "clean". I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

Finally, I'm confused about something, so maybe you could clear this up for me. I keep seeing assurances that 1) git does not actually destroy any data, and you can always recover if you screw up and 2) editing history is sometimes a vital necessity for cases like when you commit passwords. You yourself made these assurances in this comment. However, 1 and 2 are obviously mutually exclusive. If you can always recover then you can't actually scrub the repository of accidentally committed passwords and the like. Which one is actually true?

rajivm · on Dec 15, 2013

Re: 1 and 2

1) This is almost true. Anything that is committed to Git is recoverable. When you "re-write" history, Git is creating a new set of commits in the history, an "alternate history path." It does not destroy the original commits, but there is no named reference to them (unless you created a branch/tag pointing to this line of commits).

2) In this case, if you want to actually destroy these unreferenced commits, you must run "git gc". This IS a destructive command. It will remove any unreferenced commits from the repository. (gc = garbage collect). If you never garbage collect, you will always have access to anything that was ever committed. It just might be hard to find since the only reference is the ref-log (if it was recent) or the commit hash.

mikeash · on Dec 15, 2013

Since garbage collection does happen automatically after a while, it seems that the "doesn't destroy data" bit isn't completely true. But I understand that it's a fairly rare case where you're going to screw something up and then not bother to get it back until after garbage collection cleans it up.

Thanks for clarifying that.

sofal · on Dec 15, 2013

I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.

This is no more insane than editing a source code file before you save it to the file system. Git is used as a development tool as well as version control, and developers are therefore encouraged to commit often, even if the code does not actually compile yet. There is no more need to fill the published history with all of these WIP commits than there is for me to know about every goddamn keystroke you made while you were dicking around with that config file.

mtdewcmu · on Dec 16, 2013

Is the history stored as a text file somewhere that you can just edit? I sometimes wish git were a bit more transparent and less of a black box.

pyre · on Dec 16, 2013

I suggest that you pick up any git tutorial out there. It will soon become less of a black box.

mtdewcmu · on Dec 16, 2013

I've read a lot about git. The docs generally don't pick apart what's inside the .git directory.

pyre · on Dec 16, 2013

- The history items are stored as commit objects that are identified as a SHA-1 sum of the contents (including meta-data like Authored By, Committed By, etc).

- One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

- All of the commit objects are stored under .git/objects.

- Branches are just files under .git/refs/ that contain the SHA-1 sum of the most recent commit on that branch. This is why they are called 'branch pointers.' That's basically all they are.

- If you have a history of 5 commits, and make a change to the initial commit, you now have 10 commits in your .git/ directory. Your (e.g.) 'master' branch will point to the most recent 'tree' of 5 commits. The other commits will still exist in .git/objects, but there will be no branches pointing them. You can use 'git reflog' to find them, or access them by their SHA-1 sum.

- Eventually 'git gc' (gc = garbage collect) will clean out the unreferenced commits, but this happens rarely if you don't explicitly run the command.

- When you 'git push,' you are only pushing branches to the remote repo, so commits that are stored locally, which are not referenced by one of those branches you are pushing, will not be pushed out. If you have commits that you don't want to end up in limbo like this, you should 'git tag' them or create a branch (e.g. 'archive/master-2013-12' that points to them).

mtdewcmu · on Dec 16, 2013

It looks like .git/logs contains the history. It looks like the file format is a space-separated list, with the format "$parentcommitsha1 $newcommitsha1 ... $commitmessage". That's fairly comprehensible. What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure. `file $objfile` could not identify the format; it gave nonsense.

Thanks for the help.

>One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).

What sequence of operations would change a history item in that way?

pyre · on Dec 17, 2013

> It looks like .git/logs contains the history. It looks like the file format is a space-separated list, with the format "$parentcommitsha1 $newcommitsha1 ... $commitmessage". That's fairly comprehensible.

I've never looked at .git/logs, but it looks like that is used by the `git reflog` command. It's basically a history (or log) of every commit that a particular reference has pointed to[1]. For example, I cloned the git source code:

  user@host ~/src/git % cat .git/logs/HEAD
  0000000000000000000000000000000000000000 d7aced95cd681b761468635f8d2a8b82d7ed26fd First Last <user@example.com> 1387237920 -0500	clone: from https://github.com/git/git.git

  user@host ~/src/git % git reflog
  d7aced9 HEAD@{0}: clone: from https://github.com/git/git.git

Note: `HEAD` is a reference to the current branch. E.g.:

  ~/src/git $ cat .git/HEAD
  ref: refs/heads/master

  ~/src/git $ cat .git/refs/heads/master
  d7aced95cd681b761468635f8d2a8b82d7ed26fd

It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.

> What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure.

See: http://stackoverflow.com/questions/5290444/why-does-git-hash...

[1]: Since the local repository was created. This information does not sync between local and remote.

mtdewcmu · on Dec 17, 2013

>I've never looked at .git/logs, but it looks like that is used by the `git reflog` command. It's basically a history (or log) of every commit that a particular reference has pointed to[1]. For example, I cloned the git source code:

I think it's more or less the DAG represented as an adjacency list. I'd have to think a bit about why there is a separate log file for each branch. It seems that there's some redundancy in doing that, and I'm wondering what the advantages and disadvantages are of splitting the history up in that way.

>It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.

I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.

>See: http://stackoverflow.com/questions/5290444/why-does-git-hash....

That's a good link. What's in an object? If an object corresponds to a commit, then it must aggregate data about changes to multiple files.

pyre · on Dec 17, 2013

> I think it's more or less the DAG represented as an adjacency list. I'd have to think a bit about why there is a separate log file for each branch. It seems that there's some redundancy in doing that, and I'm wondering what the advantages and disadvantages are of splitting the history up in that way.

Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed. That's why there is a separate log for each branch. I guess that technically they could have a single log file and add another field to specify the branch, but using the same directory tree structure as under .git/refs/ makes the mental model simpler (and probably a performance improvement not to have to parse the reflog for every branch just to see the reflog for one branch).

> I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.

I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.

> What's in an object?

If you really care to dive deeper, you can check objects here: https://github.com/git/git/blob/master/object.h

You can get a shorter version towards the bottom of the git manpage (e.g. `man git`):

  IDENTIFIER TERMINOLOGY
         <object>
             Indicates the object name for any
             type of object.
  
         <blob>
             Indicates a blob object name.
  
         <tree>
             Indicates a tree object name.
  
         <commit>
             Indicates a commit object name.
  
         <tree-ish>
             Indicates a tree, commit or tag
             object name. A command that takes a
             <tree-ish> argument ultimately wants
             to operate on a <tree> object but
             automatically dereferences <commit>
             and <tag> objects that point at a
             <tree>.
  
         <commit-ish>
             Indicates a commit or tag object
             name. A command that takes a
             <commit-ish> argument ultimately
             wants to operate on a <commit> object
             but automatically dereferences <tag>
             objects that point at a <commit>.
  
         <type>
             Indicates that an object type is
             required. Currently one of: blob,
             tree, commit, or tag.
  
         <file>
             Indicates a filename - almost always
             relative to the root of the tree
             structure GIT_INDEX_FILE describes.

mtdewcmu · on Dec 17, 2013

I noticed that there is no delta compression until objects get incorporated into a pack.

>Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed.

I got that branches were pointers. Now that I'm aware that the DAG is fully represented inside objects, I can see that what's inside logs/ is actually just logs. Each log corresponds to a subgraph of the full DAG. Getting history from a log would be more efficient than from the objects themselves, because to get it from objects, you'd have to dereference a lot of object references.

>I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.

Having to descend through layers of subdirectories makes things harder. I'd reduce the depth of the directory tree to the absolute minimum. It's hard to tell if this is the minimum without knowing exactly what all the implementation constraints might have been.

I can see that the real meat of this system is the object store. It's useful to know about `git cat-file` for inspecting it.

pyre · on Dec 18, 2013

> Each log corresponds to a subgraph of the full DAG

I don't have the time to keep up this conversation, but this assertion is wrong. It is not a subgraph. It is a history of the values that the pointer was pointing to (e.g. "Pointer <branch_name> changed from pointing to value AAA to value BBB due to action XXX"). That is basically what all of those entries are. 'AAA' and 'BBB' maybe be in completely unconnected sections of the DAG.

If you create a new repository and add a couple of commits, then yes the reflog files will look like a history, but only because the branch pointer has traversed the DAG from start to end with no deviations.

For example you can have a DAG like this:

   A - B - C - D - E

   X - Y - Z

If you change the branch pointer to move from B to Z, this is not a subgraph. Well, I guess technically you could call it sgraph of the history of the branch pointer, but it in no way corresponds to the DAG other than that all of the pointer values exist within the DAG. For example the following operations:

  git clone
  git reset --hard Z
  git reset --hard X

Would create a graph like this (assuming that master pointed to E when you cloned):

  E - Z - X

Notice that this really don't correspond to the DAG other than the fact that those objects exist in the DAG.

Note:

- All of this information is only contained within the .git/logs files. None of it is stored in the objects themselves.

mtdewcmu · on Dec 17, 2013

I'm guessing that the reason each branch has its own history is probably related to the goal of only appending new entries at the end of things. Since any branch can be under development, they need their own files. It sort of makes sense.

pyre · on Dec 17, 2013

I still think that you're a little confused. The reflog is a "history of where this branch has pointed since the repository was created/cloned." If I clone a repository with a history of 100 commits on the 'master' branch, the reflog for the 'master' branch will only have one entry. You can completely delete the `.git/logs` and still run `git log` successfully.

Here's an example:

  $ git clone blah
  
  DAG:
  
    A - B - C - D - E
        \
         Z - X - Y
  
  
  Branches:
  
   master => E
   topic/new-feature => Y
  
  
  reflog:
  
    master
      E - clone from blah
  
    topic/new-feature
      Y - clone from blah

Notice how cloning a repository with an existing DAG doesn't populate the reflog. It just give it a single entry saying that the branch was updated from 'nothing' to whatever commit it was pointing to remotely.

Now let's change where 'master' is pointing:

  $ git reset master C
  
  
  DAG:
  
    A - B - C - D - E
        \
         Z - X - Y
  
  
  Branches:
  
   master => C
   topic/new-feature => Y
  
  
  reflog:
  
    master
      E - clone from blah
      C - reset to C
  
    topic/new-feature
      Y - clone from blah

Notice how the reflog is a history of the values that the branch was referencing, but is not the history as what you get when you run 'git log'. After the reset, 'git log master' would show you commits A, B and C, but A and B are nowhere in the reflog.

mtdewcmu · on Dec 17, 2013

I see. I started reading the internals chapter at[1]. This free book seems better than the O'Reilly book, which I bought.

So the DAG is actually stored inside objects. The contents of the objects directory could be described by a relational schema, and I think that would make it easier for a lot of people to understand (myself included):

  Blob
  - sha1hash (primary key)
  - contents (blob)

  Tree
  - sha1hash (primary key)

  TreeEntry
  - treeid (foreign key into Tree)
  - mode (mode of blob/subtree)
  - type ("blob" or "tree")
  - objectid (foreign key into Tree or Blob)
  - name

  Commit
  - sha1hash (primary key)
  - tree (foreign key into Tree)
  - parent (foreign key into Commit)
  - author
  - committer
  - comment

The tree entries are actually denormalized and stored as a list inside the tree. You could represent this more accurately with XML. But who likes XML?

[1] http://git-scm.com/book

phaemon · on Dec 15, 2013

You don't actually know how git implements it, so how can you disagree with it?

There is no such thing as "an edited version" of a commit. A commit is identified by a SHA1 hash of its index of contents. If you change one bit you get a new commit.

You're a C programmer, right? If someone gave you a specification for writing a program to implement git, without telling your what it was, you'd tell them it would take 2 weeks. And that's because you'd reckon it would take 2 hours to knock out a rough version and a couple of days to clean it up.

Seriously, it's that simple. Just go learn how it works.

mikeash · on Dec 15, 2013

I understand how it works. Of course there's such thing as "an edited version" of a commit: it's a new commit that you create by taking an existing one and altering it. If you want to argue about terminology, please be my guest, but that's all your dispute is.

phaemon · on Dec 16, 2013

If you know how it works then where did your last question come from? The bit you're "confused" about?

It's obvious what the answer is if you know how it works, so what was your point exactly?

mikeash · on Dec 16, 2013

I know how git works in general. I wasn't 100% clear on the whole garbage collection aspect of it, which is hardly a central feature.

There's a difference between "has no idea how it works" and "understands the overall structure but doesn't know every single detail".

mtdewcmu · on Dec 16, 2013

It might be a better idea to just change the password.

pyre · on Dec 16, 2013

There is a point past which you can be too pedantic and be confused with a troll. I think that you may be straddling that line.

Next time, rather than just assume that the poster isn't smart enough to realize that a compromised password should be changed, maybe you could take in the fact that it's probably just an example of data that you might want to extract from your history if it's automatically there. I can think of numerous scenarios where someone might want to remove a password from the history even if it's not compromised (e.g. want to publish a private repo).

mtdewcmu · on Dec 16, 2013

I meant to raise the question of whether it's worth trying to expunge something sensitive from git. You'd have to track down all the clones of that repository. Even if you thought it was gone, it would be prudent to change the password anyway.

pseut · on Dec 16, 2013

s/password/large binary/

nknighthb · on Dec 15, 2013

1) Why are you taking a be-careful-don't-blame-me passage from a random article written by some guy as gospel?

2) All version control systems are vulnerable to data loss if you mess around with them in unusual ways. Would you say svn was fundamentally broken if somebody told you to take a backup before you screwed around with the repo?

mikeash · on Dec 15, 2013

1) It's the sort of thing I've heard many times from many people over the years.

2) The difference is that svn does not build this functionality into the main command line tool, and there is no culture of doing terrible things with svnadmin to edit svn repositories the way there is of doing terrible things with git to rewrite git history.

nknighthb · on Dec 15, 2013

I can't agree that UI quibbles and your perception of the "culture" (a perception I don't share at all) are "fundamental flaws" in git.

perlgeek · on Dec 15, 2013

On piece of data that is easy to lose, with any version control system I've worked with so far, is uncommitted data. And that's also the only data I've lost with git so far, after using it for several years. (And yes, it was my own stupidity, saying 'git checkout .' and only noticing later that there was something I wanted to keep).

The advise to take a backup doesn't hurt, and might be helpful if restoring the original state is more effort than doing it with git operations.

oneandoneis2 · on Dec 15, 2013

you don't need to make backups - git won't lose your data. I lost all hope that the article might be worth reading at that line.

mateuszf · on Dec 15, 2013

Yep, that's true. Restoring data is as simple as checking latest changes using "git reflog"

mtdewcmu · on Dec 15, 2013

I agree that git is both a great advance and seems fundamentally broken at the same time. One of git's advances is that it treats commits as snapshots of the entire tree rather than diffs[1]. A snapshot might as well be a tarball of the whole directory, except that git uses references to previous snapshots to store it efficiently. So in this aspect, git is like a backup tool plus compression. It's not quite a useful tool just for making compressed backups of source code, though, because data is buried in opaque internal files in the .git directory and can't be untangled from the commit history. You can't get at your data without going through git's tools, which means you might need to make your own backups in case git goes insane, and you can't use the backup functionality without creating indelible history.

I'm thinking that the repository could be moved out of the working directory and placed in its own file that's not invisible. If the repo was reified into a visible file, then repos would be portable and you could ftp them. The backup functionality could be separated from the history-tracking functionality, so you could make backups freely without adding noise to the commit history. A backup would basically be a tarball that you could append to a repo file, taking advantage of previous entries for compression. Commits, however they were implemented, could reference snapshots, but they needn't be 1:1.

[1] http://git-scm.com/book/ch1-3.html

eru · on Dec 17, 2013

> I'm thinking that the repository could be moved out of the working directory and placed in its own file that's not invisible.

Symlinks are your friends.

> then repos would be portable and you could ftp them.

tar might come in handy.

> The backup functionality could be separated from the history-tracking functionality, so you could make backups freely without adding noise to the commit history. A backup would basically be a tarball that you could append to a repo file, taking advantage of previous entries for compression.

You can already do this. You can have commits without ancestors or descendants in your repository, and they will still benefit from delta compression.

mtdewcmu · on Dec 17, 2013

The idea is to make it less monolithic. I'm digging more into the inner workings of git now. The functionality might already be in there, just not obvious.

mcv · on Dec 16, 2013

Backups are of course always a good idea, but you don't need them specifically to work with git. Git is its own backup system. If you think you might do something potentially harmful, do it in a new branch. If something goes wrong, you can always throw it away.

If something has already gone wrong, and you didn't do it in a separate branch, you can still go back to a previous situation.

Rewriting history in any serious sense (beyond a local reset or rebase for stuff that hasn't been pushed to anyone else yet) is always a bad idea. History is history for a good reason.

Of course any existing commit can always be reverted; that's not rewriting history. A revert is simply a new commit.

Estragon · on Dec 15, 2013

It makes it fast and easy to back out if you screw anything up. Even if the data is still there, it can be complex to pull it back out and configure it the way it was when you started (as the commands in this tutorial demonstrate.) So a fast, easy snapshot before executing complex commands is a smart move.

zimbatm · on Dec 15, 2013

git is safe but you have to know all the fancy commands like `git reflog`. I remember being puzzled by a merge conflict when I started learning git. I didn't know what it was and `git reset` or `git revert` weren't doing what I expected. All I wanted was to go back to the previous state. In the end it was easier to clone the repo and start over again.

mtdewcmu · on Dec 16, 2013

You probably wanted `git merge --abort`. It's not very clear what the various states are that git can be in. There seems to be a 'fixing merge conflict' state, and it's hard to find documentation that warns you about this state and what your options are once you're in it.

rebelidealist · on Dec 15, 2013

sigh it seems to me that Git is unnecessarily complicated. Wonder what if "github" started with HG.

tytso · on Dec 15, 2013

There are two ways things can be simple or complicated. One is to have a big button labelled "DWIM", which always does the right thing --- until it doesn't, and then you have to go out of your way to work around its assumption of what you want to do.

The other way is to have a number of simple concepts which can be combined in various powerful ways. Once you understand these simple concepts, you can compose them to do whatever you need. Git is simple the same way that RISC is simple, and having a manual transmission is simple. You can do a lot more with a manual transmission car than you can with an automatic --- but if you're not careful you can strip the gears. Yet a manual transmission is simpler to maintain, and more efficient (in the hands of someone who knows how to use it) than a automatic transmission. If you take a look at the post, you'll see that the various recipes only use a handful of git commands. Once you've mastered those commands, things are indeed quite simple.

crystaln · on Dec 15, 2013

That would be true if git's command line interface were not so inconsistent and obtuse. I agree the underlying concepts are simple, which is why the command line interface is so baffling.

Crito · on Dec 15, 2013

The standard git porcelain has it's problems, but they are largely irrelevant to the question of git's simplicity. Issues like --all/-A, or the -b flag of git-commit are unfortunate, but they do not affect the underlying simplicity that tytso is talking about. That underlying simplicity is what makes git a pleasure to work with despite weird porcelain because it allows you to reason about operations in git without reasoning about what different commands are for or can do.

If you want to know if some operation can be done, you don't reason about git-reset, git-checkout, git-branch, etc.. you reason about the DAG. After you have a solid mental image of what you are attempting to do to the DAG, it is a simple matter to decompose that action into a few weird but ultimately simple incantations with the porcelain. If you are interested in optimizing how many steps you decompose operations into, then you can learn the esoteria of a few git operations, but all of the hard thinking, the real problem-solving, was done in the context of a different abstraction.

mtdewcmu · on Dec 16, 2013

It's hard to know how elegant and simple something can be made. git is a nice tool, but we shouldn't assume that it's the last word and can't be improved upon.

jordigh · on Dec 15, 2013

Hg is working on a feature that is betaish right now:

http://mercurial.selenic.com/wiki/ChangesetEvolution

It's been brewing for some time. Basically, the idea is to be able to make it easy to safely edit history collaboratively, with a consistent UI. Facebook is pumping a lot of money into hg right now, and seems particularly interested in getting this feature off the ground.

A number of pieces have been falling into place for this to occur. The first was to have phases, indicators of which commits are safe to edit collaboratively or not, a feature that some git users have wanted:

https://github.com/peff/git/wiki/SoC-2012-Ideas#published-an...

Mercurial now has this feature and uses it as part of the logic for the evolve extension. With this in place, hg is able to transmit metadata that indicates automatically which commits need to be fixed up if you want to edit a commit that someone else has also edited, or if someone edited a commit on top of which you've based off other commits.

The idea is to make something like "git push --force" obsolete. History is safe to edit, and commits can't get lost, not even by accident:

http://www.infoq.com/news/2013/11/use-the-force

By the way, an epilogue to that Jenkins story is that it wasn't completely trivial to recover all lost history, and at least for some of the smaller repos, they never managed to figure out exactly which version was the canonical one.

RyanZAG · on Dec 15, 2013

I love this kind of attitude: something seems complex? Throw it out and start again!

Unfortunately, it's usually the problem domain that is complex and starting over just means you have to rediscover all of that complexity all over again. HG has more than its fair share of complicated tasks.

pseut · on Dec 15, 2013

Git is designed for project maintainers, and a lot of the complication is necessary for them (that view helps me, at least)

skylan_q · on Dec 15, 2013

Anyone unfamiliar with the most basic of workflows would find this needlessly complex. Just a couple of months ago, I would have.

Now that I have familiarity with local branching, remote branches, how the 3-way merge works (conceptually) and rebasing, this article comes off as a guide on how to do things that you wouldn't have to do to often anyways.

lukasm · on Dec 15, 2013

in many ways hg is superior, but you can't win with Linus blessing.

rspeer · on Dec 15, 2013

Thanks, this is a useful reference.

I am sad about some of these other comments, which I might paraphrase as "This doesn't help me, and it might help people who are less skilled than me who don't deserve to be helped, therefore it's worthless". It's apparently a common sentiment on this site, but it shouldn't be.

caipre · on Dec 15, 2013

Usability note: after a few clicks through this (so my path had a few entries) I instinctively clicked up a few levels in the path expecting to be taken to that point. Instead, that entry was appended as another child.

crystaln · on Dec 15, 2013

The inability to, in any remotely easy way, remove mistakenly checked in large files and private data has always seemed like a major flaw with git.

pyre · on Dec 15, 2013

Well, the solution to other systems seems to be "it's checked in, therefore it can never be un-checked-in, so deal with it!" (or at least this is the attitude of some vocal proponents of them).

ams6110 · on Dec 15, 2013

The flaw is in having this "private" data in a public repo to begin with. If your data are private, don't put your project on github.

crystaln · on Dec 16, 2013

While I'm certain you and your organization have a perfect record of never checking inappropriate things into your git repository, mine does not. Even if all the employees at your company were perfect, there is still a chance of inappropriate information getting into the repository.

mcv · on Dec 16, 2013

Rule number one: if you're not sure what you're doing, do it in a new branch. If things go wrong, you can always delete that branch.

And you can always make a branch out of a previous situation. Gitk/gitx make this particularly easy.

elwell · on Dec 16, 2013

Sentence 2 has typo "or" -> "of"