Blog Archives

Git, Dotfiles, and Hardlinks

One handy use for Git is keeping track of your dotfiles – all of those configuration files that live inside your home directory like .screenrc, .gitconfig, .vimrc, et cetera.

A typical first approach often winds up looking something like this:

~ $ git init dotfiles
Initialized empty Git repository in /home/aiiane/dotfiles/.git/
~ $ cd dotfiles
~/dotfiles $ ln -s ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Track my vim config file"

All seems well until you look at the diff for the commit you just made and see something like this:

diff --git a/.vimrc b/.vimrc
new file mode 120000
index 0000000..6ba8edc
--- /dev/null
+++ b/.vimrc
@@ -0,0 +1 @@
+/home/aiiane/.vimrc
\ No newline at end of file

As it turns out, Git knows about symlinks and thus faithfully records the symlink as just that – a symlink – instead of recording the contents of the symlinked file. Oftentimes this leads to the thought of “well, if Git knows symlinks, then I can probably use a hardlink instead.” And thus the second attempt generally continues something like this:

~/dotfiles $ rm .vimrc
~/dotfiles $ ln ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Hardlink in my config file"

And this seems at first to work – looking at the diff, you see the contents of your config file being added; you push your commit to GitHub or whatever and your config file shows up properly there…

…until you try to have Git update the file. Perhaps you made a change somewhere else and now want to git pull, or perhaps you made a change locally that you decided you didn’t want and so you use git checkout .vimrc in your repo to change it back to your committed version. At this point you discover that while it seems to change the file in your repo, it doesn’t update the file in your home directory – the hardlink has been broken.

The reason for this is that Git never modifies files in the working tree – instead, it unlinks them and then recreates them from scratch. This inherently breaks any hardlinks that might have been present.

The third attempt is generally what winds up working, when you realize that what you can do does actually use symlinks, but rather than symlinking files outside of the repository into it, doing it the other way around: symlinking files inside the repository out of it, into your home directory:

~/dotfiles $ rm ~/.vimrc
~/dotfiles $ cd ~
~ $ ln -s dotfiles/.vimrc .

Since most programs tend to handle symlinks transparently (unlike Git), this lets you use Git to update the actual copy of the file in the Git working tree, and have those changes reflected in the path where your programs expect to find it.

An alternative approach

Astute readers may notice that there is another possibility: why not just make your home directory the Git working tree?

~ $ git init
Initialized empty Git repository in /home/aiiane/.git/
~ $ git add .vimrc
~ $ git commit -m "Track vimrc"

While this does work, it has its own drawbacks. The most significant is probably the large number of things in your home directory that you typically don’t want to track which clutter up git status. For those who take the home-as-worktree path, a logical solution to this problem is to just ignore everything by default:

~ $ echo "*" > .gitignore
~ $ git add -f .gitignore
~ $ git commit -m "Ignore all by default"

Then only things you’ve explicitly added (and you’ll need to use git add -f the first time you add each file) will be tracked. If you do this, however, it’s harder to tell at a glance whether or not a given file is being tracked, and new files that you might want to add won’t stand out. There is a way to check, though:

~ $ git ls-files
.gitignore

The other main potential drawback of using your home directory as a working tree is that it effectively requires you to version the same files on every machine – since Git doesn’t really do partial checkouts gracefully, once a particular path in your home directory is added, it’ll be tracked anywhere that pulls the commit which added it. Most of the time this probably won’t be an issue – generally if you want to track something, you want it the same everywhere – but it’s something to bear in mind if you choose this approach.

Multiplexing Git Hooks

Git hooks are great, especially in larger companies – they allow you to do everything from pre-commit syntax checking to post-push IRC reporting and more. The way Git implements its hooks (as scripts with specific names), however, has a drawback: it’s nontrivial to have a given hook event call multiple different scripts.

For small-time operations, this often isn’t much of a problem – everything can just get put into the same hook script. As my employer’s usage of Git has expanded along with its engineering team, however, we find ourselves wanting to be able to do more mix-and-matching of various hook functionality across different repositories. The ideal situation would be to have one script per functional item, and simply tell Git which ones to call for a given repository + event combination.

I created githook-proxy.sh for this purpose. It’s not all that complex (less than 20 lines of bash when you omit the comments), but it provides very handy functionality. Here’s how it works:

First, symlink the proxy script into the repository’s hooks directory, named like the hook event you want to proxy…

cd /path/to/repo/.git/hooks
ln -s /path/to/githook-proxy.sh pre-commit

Second, in the same directory symlink in each of the actual hook scripts you want to run, named like the hook event plus a suffix:

ln -s ~/check-syntax.sh pre-commit-01-check-syntax
ln -s ~/fix-style.sh pre-commit-02-fix-style

And you’re done! The proxy script will automatically find the suffixed hooks and run them in sorted order (hence the usage of -01-… and -02-…) when that hook event occurs. You can repeat the process for each other hook event you want to multiplex. For hooks where the exit code matters, the proxy still runs all the matching scripts, but aggregates the exit codes – if one or more scripts exited nonzero, the proxy will also exit nonzero.

Why your company shouldn’t use Git submodules

A programmer had a version control problem and said, “I know, I’ll use submodules.” Now they have two problems.

It is not uncommon at all when working on any kind of larger-scale project with Git to find yourself wanting to share code between multiple different repositories – whether it be some core system among multiple different products built on top of that system, or perhaps a shared utility library between projects.

At first glance, Git submodules seem to be the perfect answer for this: they come built-in with Git, they act like miniature repositories (so people are already familiar with how to change them), et cetera. They even support pointing at specific versions of the shared code, so if one project doesn’t want to deal with integrating the “latest and greatest” version, it doesn’t have to.

It’s after you’ve actually worked with submodules for a while that you start to notice just how half-baked Git’s submodules system really is.

Don’t blink…

Submodules are effectively separate repositories within the directory tree of their parent repository. The only linkage between the parent and the submodule is recorded value of the submodule’s checked-out SHA which is stored in the parent’s commits, and changes in that recorded SHA are not automatically reflected in the submodules.

This means that if someone else updates the recorded version of a submodule and you pull their latest changes in the parent repository, your submodule repository will still be pointing to the old version of the submodule. (To update it, you’d need to run git submodule update.)

Of course, if you forget to update your submodule to the new version, it’s then quite easy to commit the old submodule version in your next parent repository commit – thus effectively reverting the submodule bump by the other developer. Given that submodule changes only show up as 2 commit lines in a diff, it’s not hard for such a change to slip by (especially if you’re a developer that tends to use git add . or git commit -a most of the time).

Many code review tools (such as Review Board) don’t support showing submodule changes in code reviews, so an accidental submodule revert isn’t likely to get noticed in review, either.

Merging? Ha!

When Git drops into conflict resolution mode, it still doesn’t update the submodule pointers – which means that when you commit the merge after resolving conflicts, you run into the same problem as in the previous section: if you forgot to run git submodule update, you’ve just reverted any submodule commits the branch you merged in might have made.

Furthermore, Git doesn’t really handle submodule merging at all. It detects when two changes to the submodule’s SHA conflict… but that’s it. Since there’s no way to have two versions of a submodule checked out at once, it simply doesn’t try, effectively treating the entire submodule like a single binary file. It’s left to the developer to try to sort out what should be done to get a working submodule out of whatever the branch they’re merging in wanted and what their own changes required.

(If you’ve ever tried to have two people working on a binary file that’s tracked in Git, you’ll have an idea of how much of a pain it is to resolve such conflicts.)

You typically wind up settling for one of two equally distasteful options: either you have individual branches for submodule changes that mirror the parent repository’s branches (so that you can merge the submodule branches when merging the parent’s branches), or you force everyone into an effectively Subversion-style linear history of submodule updates with everyone being required to merge in previously added submodule changes before they can make their own.

There’s a reason why I know a lot of people who have nicknamed these things “sobmodules” in their frustration.

Oh, were you using that?

When you invoke git submodule update it looks in the parent repository for a SHA for each submodule, goes into those submodules, and checks out the corresponding SHAs. As would be the case if you checked out a SHA in a regular repository, this puts the submodule into a detached HEAD state.

If you then make changes in the submodule and commit then, Git will happily create the commit… and leave you still with a detached HEAD. See where this is going yet?

Say you merge in some more changes which happen to include another submodule update. If you haven’t committed your own submodule change into the parent project yet, Git won’t consider your new commit in the submodule as a conflict, and if you run git submodule update it will happily wipe out your commit without warning, replacing it with that from the branch you just merged in.

I hope you had your submodule’s reflog enabled or still have the old commit in your terminal scrollback, because otherwise, you just lost all that work you did.

What am I supposed to do with this?

Submodules acting as almost completely independent repositories has another catch, too – you have to push changes from both the submodule and the parent repository to share with others.

Push changes from the submodule and not the parent repository? No one knows to use your new submodule changes.

Push changes from the parent repository and not the submodule? Congratulations, no one can use your new commits because they don’t have the right submodule commit available to check out.

Well, what else could we do?

So if submodules are such a pain, what are the alternatives? Here’s an overview of some of the most popular. Which one is best for you depends on your priorities.

Repo

Repo is a tool created by Google to manage the rather large Android project, which is spread across multiple different Git project repositories. It essentially works by providing a way to check out multiple projects (Git repositories) in parallel based on a manifest file (which basically serves the purpose that a parent repository does for Git submodules – tracking which submodule commits go together). It also provides a way to submit an atomic changeset that includes changes to multiple different projects.

The downside is that Repo doesn’t handle merging very well: it essentially expects you to rebase your changes when you want to bring in outside updates, effectively bringing things back to the equivalent of svn update. If you’re a fan of many small commits over a few large ones, this can get onerous.

Gitslave

Gitslave is a wrapper around Git that multiplexes git commits into multiple repositories. It effectively implements the “have parallel branches for each of your projects” solution to the merging problem by doing that for you – if you create a branch, it gets created everywhere. If you commit, all of your repositories create a commit, and so on.

Of course, this can get rather hectic if you have a large number of projects and start running into things like merge conflicts in 5 different repositories. It also means you potentially wind up making a lot of pointless extra branches in projects that you didn’t happen to touch while touching another project.

Git Subtree

Git Subtree is a tool that uses Git’s “subtree merge” functionality to get a similar result to submodules, but via actually storing the files in the main repository and merging in changes directly to that repository.

The upside is that you avoid all the issues with submodule merging because the contents of your subprojects are stored directly in the parent repository and thus are treated like any other tracked files when pulling and merging.

The downside is that all of your subproject files are present in the parent repository, which means you’re giving up some of the reason for originally splitting up your project repositories: having one canonical repository for a given set of shared code. If someone makes a change to a subproject, they can merge it with other changes locally, but they’d have to explicitly split that change back out of their project if they wanted to share it with projects.

Others

A couple of other potential options are Braid and giternal, both of which offer a more svn-externals kind of external dependency linking (in the sense that you can ask it to grab the latest version of a given repository’s contents and place it in your tree).

‘git stash pop’ considered harmful

Git has a number of features designed to ease development hassle. One oft-mentioned example is git stash, which allows you to take any uncommitted changes and “stash them away.”

After changes have been stashed, there are a few options of how to get them back:

  • git stash pop takes a stashed change, removes it from the “stash stack”, and applies it to your current working tree.
  • git stash apply takes a stashed change and applies it to your current working tree (also leaving it on the “stash stack”).
  • git stash branch creates a new branch from the same commit you were on when you stashed the changes, and applies the stashed changes to that new branch.

Those who begin using stashing tend to just use the first option, pop – after all, stashing is designed to reduce development hassle and pop, which cleans up the stash you probably don’t care about anymore, has the least hassle involved. Right?

Wrong.

Sure, pop saves you a git stash drop after you’ve re-applied your stashed changes… some of the time.

For instance, say your stashed changes conflict with other changes that you’ve made since you first created the stash. Both pop and apply will helpfully trigger merge conflict resolution mode, allowing you to nicely resolve such conflicts… and neither will get rid of the stash, even though perhaps you’re expecting pop to. Since a lot of people expect stashes to just be a simple stack, this often leads to them popping the same stash accidentally later because they thought it was gone.

Luckily…

It’s quite possible to avoid this by simply using git stash apply consistently instead of git stash pop.

(A side note: if you apply a stash and it conflicts with an already staged-change, you can get the originally staged version of the file back via git checkout --ours <path>. This can be handy if you forget to commit before failing to apply a stash.)

However…

I’d go one step further and suggest another option for your consideration: don’t use stashing at all. One of Git’s biggest strengths is that commits and branches are cheap. Instead of creating stashes, why not just create a new branch and commit your changes on it? There are many reasons to use real branches instead of stashes:

  • You can always get rid of the branch later after merging or cherry-picking your changes off it.
  • Your changes are always in the context in which they were created (since they have branch history).
  • It’s harder to forget about things that show up in the branch list (I know lots of people who forget about stashes).
  • You can easily swap to that branch if you think of some more things you want to add to your saved changes, and then swap back.
  • Creating commits gives you more impetus to actually associate a message with them, making it easier to remember what you were doing.

Plus, you can get near stash-like functionality via commits and branches with a Git alias:

[alias]
    save = !sh -c 'export PREV=$(git symbolic-ref HEAD|cut -d/ -f3-) && git checkout -b "$1" && git commit -am "$1" && git checkout "$PREV"' -

With this git alias, you can do git save foobar and it will:

  1. Create a branch named “foobar”
  2. Commit any changes on that branch
  3. Swap you back to the branch you started on

All with a single command, just like git stash does, but with none of the drawbacks of the stash system.