Git annex

git annex logo

GIT-Annex is an extension/alternative to GIT and allows it to checking big files into a GIT repository without actually tracking the content. This sounds counter-intuitive at the first look, but makes sense if you start thinking about using GIT to distribute files throughout repositories where it matters more that the file is there and not so much what’s in there. Think about Media and Binary files which don’t change that often. Still not convinced? No worries. I’ve spend a whole day playing with it without having a use-case. One night of sleep and I finally realized how I could benefit from it.

What it does

Files added to a GIT-Annex repository are immediately moved to a sub-directory in .git, called .git/annex/objects and renamed with a hash-value. The old location is replaced with a symbolic link pointing to that hash-value.

All files are locked to read-only by GIT-Annex and must be specifically unlocked before any modification.

We don’t care what the content of a file is, as long as the hash value (and therefore the content) does not change.

GIT-Annex transforms the usual GIT repository and commands a little bit. The commands to add and remove (drop) files from the repository all have the additional parameter _annex_ in it, e.g.

$ git annex add <file>

The current version (2014-01) of GIT-Annex is version 5.2.

$ git annex version
git-annex version: 5.20140117
build flags: Assistant Webapp Pairing Testsuite S3 WebDAV Inotify DBus XMPP Feeds Quvi TDFA CryptoHash
key/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL
remote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier hook external
local repository version: 5
supported repository version: 5
upgrade supported from repository versions: 0 1 2 4

Usage

  • Creating a repository:

    $ git init
    $ git config user.name <name>
    $ git config user.email <email>
    $ git annex init "repository-name"
    
  • Adding files to the repository:

    This goes significantly faster than adding the files directly into GIT. Depending on the amount of data, of course. I’ve added 25GB within 13 minutes on a system that was copying file in the background.

    $ git annex add _folder_
    $ git commit -m "Added _folder_
    
  • Removing files permanently:

    Removing files permanently is a bit harder than only to delete them. You’re dealing with a version control system here. So you delete the content like you would do it in GIT and then let GIT-Annex clean-up for you as well. The better alternative however might be to move the data simply into an archive repository.

    $ git rm --cache <file>       # General Syntax
    $ git commit -m "Droped file"
    
    $ git annex unused            # Show unused file(s)
    unused . (checking for unused data...)
      Some annexed data is no longer used by any files in the repository.
        NUMBER  KEY
        1       SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
        2       SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
      (To see where data was previously used, try: git log --stat -S'KEY')
      (To remove unwanted data: git-annex dropunused NUMBER)
    ok
    
    $ git annex dropunused 1      # Drop file 1
    dropunused 1 ok
    
    $ git annex dropunused 1-1000 # Drop all files from 1 to 1000
    
    
    $ git annex copy --unused --to backup   # Alternative
    $ git annex move --unused --to archive  # Alternative
    
  • Removing files locally:

    This will remove the file content from the local repository, but not the metadata stored about it. When running _drop_, GIT-Annex will try to contact the other remotes in order to verify that the content is still available there. If no repository that still has the file content can be contacted, the _drop_-procedure will fail. This can be overwritten with –force.

    $ git annex drop <file>
    $ git annex --force drop <file>
    
  • Removing a file

    $ rm FILE
    $ git commit -a
    
  • Removed content of removed files

    $ git annex unused
    $ git annex dropunused ...
    
  • Getting file content:

    This command will contact other remotes and transfer the file content into the local repository. It no other remote with the file content can be contacted, GIT-Annex will let you know which remotes are unavailable and where they should be.

    $ git annex get [.|<file>]
    
  • Export data:

    There’s no export of data. All data is still there as symbolic links. Copy those instead

    $ cp -L <file>
    
  • Setup a remote

    A remote can be a remote directory or a folder on a different server, accessible via SSH. The value for <remotename> should be chosen with giving it a thought. These are basically labels which describe where the data is stored as well. Something like blabla doesn’t really help when a repository isn’t available and you need to figure out which one and why.

    $ git remote add "<remotename>" path/to/remote
    
  • Moving/Copy files to a different location:

    Here the (clever) choice of your remote name comes back into play. You address your remotes by the name, not the address. Equal to the remote setup in GIT.

    $ git annex move --to <remotename> <file>
    $ git annex copy --to <remotename> <file>
    
  • Sync repositories (Meta-data)

    This sends and receives updates about the meta-data files (and the content, depending on your settings) in your local and remote repositories.

    $ git annex sync
    
  • Setup synchronization rules

  • Verify the data

    $ git fsck
    Checking object directories: 100% (256/256), done.
    Checking connectivity: 2249, done.
    
  • Find location of a file

    $ git annex whereis <file>
    
  • Increase number of copies

    $ echo "* annex. numcopies =3" >> . gitattributes
    
  • History

    $ git log --stat --decorate
    
  • Edit a file

    $ git annex unlock FILE
    $ git annex add FILE
    
  • Tips

    Git-Annex is coming already with [dozens of tips][3] and use-cases where people simply took and threw it on something else to get it fixed.


Installation

In the Ubuntu repositories GIT-Annex is still version 3.2 - quite old. It worked for me on one machine, on another I couldn’t add a single file into the repositories. Neither could I figure out why it wasn’t working.

Compiling GIT-Annex seems to be also kind of a hassle. There’s however an Ubuntu PPA the delivers the binaries for Ubuntu 12 and 13:

$ sudo apt-add-repository 'deb http://ppa.launchpad.net/fmarier/git-annex/ubuntu precise main'
$ sudo apt-get update && sudo apt-get install git-annex

Use-case

As mentioned earlier I struggled to find a meaningful use-case for GIT-Annex in my setup. Until it hit me. Whenever I watch a series on my media-pc, after I’m done I either delete the series or make a copy to a backup-disk before deleting it. Over the years this has become a bit messed up and episodes have spread over several disks or folders. I’m not even sure I have a backup of everything (not good, I know, but hey: we’re talking about series here, nothing important).

GIT-Annex can jump in here for me.

  1. I can now much simpler transfer files from one storage to the other, simply by addressing the location the data shall be.

  2. In a local repository with literally almost no disk space used I can manage new incoming files easier and send data from and to locations.

  3. I have an inventory telling me where I find something when I’m looking for it. In seconds.


Alternatives


Caveats

  • There’s only a limited amount of valid GIT commands that do _not_ fuck up the Repository. :(

  • GIT-annex invokes a separate rsync process per file.

Resumè

GIT-Annex seems to be useful. It took me a bit to understand why and how I could benefit from it. In the end it seems quite good and reliable. It might not be as powerful as GIT, but it certainly take more than one day to learn and master. Especially with all the possibilities you have in distributing your data over the globe. Just the list of special repositories is impressing. Even standard features like encrypting are available.