The git-annex assistant is being crowd funded on Kickstarter. I'll be blogging about my progress here on a semi-daily basis.

Since last post, I've worked on speeding up git annex watch's startup time in a large repository.

The problem was that its initial scan was naively staging every symlink in the repository, even though most of them are, presumably, staged correctly already. This was done in case the user copied or moved some symlinks around while git annex watch was not running -- we want to notice and commit such changes at startup.

Since I already had the stat info for the symlink, it can look at the ctime to see if the symlink was made recently, and only stage it if so. This sped up startup in my big repo from longer than I cared to wait (10+ minutes, or half an hour while profiling) to a minute or so. Of course, inotify events are already serviced during startup, so making it scan quickly is really only important so people don't think it's a resource hog. First impressions are important. :)

But what does "made recently" mean exactly? Well, my answer is possibly overengineered, but most of it is really groundwork for things I'll need later anyway. I added a new data structure for tracking the status of the daemon, which is periodically written to disk by another thread (thread #6!) to .git/annex/daemon.status Currently it looks like this; I anticipate adding lots more info as I move into the syncing stage:

lastRunning:1339610482.47928s
scanComplete:True

So, only symlinks made after the daemon was last running need to be expensively staged on startup. Although, as RichiH pointed out, this fails if the clock is changed. But I have been planning to have a cleanup thread anyway, that will handle this, and other potential problems, so I think that's ok.

Stracing its startup scan, it's fairly tight now. There are some repeated getcwd syscalls that could be optimised out for a minor speedup.


Added the sanity check thread. Thread #7! It currently only does one sanity check per day, but the sanity check is a fairly lightweight job, so I may make it run more frequently. OTOH, it may never ever find a problem, so once per day seems a good compromise.

Currently it's only checking that all files in the tree are properly staged in git. I might make it git annex fsck later, but fscking the whole tree once per day is a bit much. Perhaps it should only fsck a few files per day? TBD

Currently any problems found in the sanity check are just fixed and logged. It would be good to do something about getting problems that might indicate bugs fed back to me, in a privacy-respecting way. TBD


I also refactored the code, which was getting far too large to all be in one module.

I have been thinking about renaming git annex watch to git annex assistant, but I think I'll leave the command name as-is. Some users might want a simple watcher and stager, without the assistant's other features like syncing and the webapp. So the next stage of the roadmap will be a different command that also runs watch.

At this point, I feel I'm done with the first phase of inotify. It has a couple known bugs, but it's ready for brave beta testers to try. I trust it enough to be running it on my live data.

Posted Fri Jun 15 11:16:42 2012

First day of Kickstarter funded work!

Worked on inotify today. The watch branch in git now does a pretty good job of following changes made to the directory, annexing files as they're added and staging other changes into git. Here's a quick transcript of it in action:

joey@gnu:~/tmp>mkdir demo
joey@gnu:~/tmp>cd demo
joey@gnu:~/tmp/demo>git init
Initialized empty Git repository in /home/joey/tmp/demo/.git/
joey@gnu:~/tmp/demo>git annex init demo
init demo ok
(Recording state in git...)
joey@gnu:~/tmp/demo>git annex watch &
[1] 3284
watch . (scanning...) (started)
joey@gnu:~/tmp/demo>dd if=/dev/urandom of=bigfile bs=1M count=2
add ./bigfile 2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.835976 s, 2.5 MB/s
(checksum...) ok
(Recording state in git...)
joey@gnu:~/tmp/demo>ls -la bigfile
lrwxrwxrwx 1 joey joey 188 Jun  4 15:36 bigfile -> .git/annex/objects/Wx/KQ/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee
joey@gnu:~/tmp/demo>git status -s
A  bigfile
joey@gnu:~/tmp/demo>mkdir foo
joey@gnu:~/tmp/demo>mv bigfile foo
"del ./bigfile"
joey@gnu:~/tmp/demo>git status -s
AD bigfile
A  foo/bigfile

Due to Linux's inotify interface, this is surely some of the most subtle, race-heavy code that I'll need to deal with while developing the git annex assistant. But I can't start wading, need to jump off the deep end to make progress!

The hardest problem today involved the case where a directory is moved outside of the tree that's being watched. Inotify will still send events for such directories, but it doesn't make sense to continue to handle them.

Ideally I'd stop inotify watching such directories, but a lot of state would need to be maintained to know which inotify handle to stop watching. (Seems like Haskell's inotify API makes this harder than it needs to be...)

Instead, I put in a hack that will make it detect inotify events from directories moved away, and ignore them. This is probably acceptable, since this is an unusual edge case.


The notable omission in the inotify code, which I'll work on next, is staging deleting of files. This is tricky because adding a file to the annex happens to cause a deletion event. I need to make sure there are no races where that deletion event causes data loss.

Posted Fri Jun 15 11:16:42 2012

Since my last blog, I've been polishing the git annex watch command.

First, I fixed the double commits problem. There's still some extra committing going on in the git-annex branch that I don't understand. It seems like a shutdown event is somehow being triggered whenever a git command is run by the commit thread.

I also made git annex watch run as a proper daemon, with locking to prevent multiple copies running, and a pid file, and everything. I made git annex watch --stop stop it.


Then I managed to greatly increase its startup speed. At startup, it generates "add" events for every symlink in the tree. This is necessary because it doesn't really know if a symlink is already added, or was manually added before it starter, or indeed was added while it started up. Problem was that these events were causing a lot of work staging the symlinks -- most of which were already correctly staged.

You'd think it could just check if the same symlink was in the index. But it can't, because the index is in a constant state of flux. The symlinks might have just been deleted and re-added, or changed, and the index still have the old value.

Instead, I got creative. :) We can't trust what the index says about the symlink, but if the index happens to contian a symlink that looks right, we can trust that the SHA1 of its blob is the right SHA1, and reuse it when re-staging the symlink. Wham! Massive speedup!


Then I started running git annex watch on my own real git annex repos, and noticed some problems.. Like it turns normal files already checked into git into symlinks. And it leaks memory scanning a big tree. Oops..


I put together a quick screencast demoing git annex watch.

While making the screencast, I noticed that git-annex watch was spinning in strace, which is bad news for powertop and battery usage. This seems to be a GHC bug also affecting Xmonad. I tried switching to GHC's threaded runtime, which solves that problem, but causes git-annex to hang under heavy load. Tried to debug that for quite a while, but didn't get far. Will need to investigate this further.. Am seeing indications that this problem only affects ghc 7.4.1; in particular 7.4.2 does not seem to have the problem.

Posted Fri Jun 15 11:16:42 2012

After a few days otherwise engaged, back to work today.

My focus was on adding the committing thread mentioned in day 4 speed. I got rather further than expected!

First, I implemented a really dumb thread, that woke up once per second, checked if any changes had been made, and committed them. Of course, this rather sucked. In the middle of a large operation like untarring a tarball, or rm -r of a large directory tree, it made lots of commits and made things slow and ugly. This was not unexpected.

So next, I added some smarts to it. First, I wanted to stop it waking up every second when there was nothing to do, and instead blocking wait on a change occuring. Secondly, I wanted it to know when past changes happened, so it could detect batch mode scenarios, and avoid committing too frequently.

I played around with combinations of various Haskell thread communications tools to get that information to the committer thread: MVar, Chan, QSem, QSemN. Eventually, I realized all I needed was a simple channel through which the timestamps of changes could be sent. However, Chan wasn't quite suitable, and I had to add a dependency on Software Transactional Memory, and use a TChan. Now I'm cooking with gas!

With that data channel available to the committer thread, it quickly got some very nice smart behavior. Playing around with it, I find it commits instantly when I'm making some random change that I'd want the git-annex assistant to sync out instantly; and that its batch job detection works pretty well too.

There's surely room for improvement, and I made this part of the code be an entirely pure function, so it's really easy to change the strategy. This part of the committer thread is so nice and clean, that here's the current code, for your viewing pleasure:

[[!format haskell """ {- Decide if now is a good time to make a commit. - Note that the list of change times has an undefined order. - - Current strategy: If there have been 10 commits within the past second, - a batch activity is taking place, so wait for later. -} shouldCommit :: UTCTime -> [UTCTime] -> Bool shouldCommit now changetimes | len == 0 = False | len > 4096 = True -- avoid bloating queue too much | length (filter thisSecond changetimes) < 10 = True | otherwise = False -- batch activity where len = length changetimes thisSecond t = now diffUTCTime t <= 1 """]]

Still some polishing to do to eliminate minor innefficiencies and deal with more races, but this part of the git-annex assistant is now very usable, and will be going out to my beta testers soon!

Posted Fri Jun 15 11:16:42 2012

Only had a few hours to work today, but my current focus is speed, and I have indeed sped up parts of git annex watch.

One thing folks don't realize about git is that despite a rep for being fast, it can be rather slow in one area: Writing the index. You don't notice it until you have a lot of files, and the index gets big. So I've put a lot of effort into git-annex in the past to avoid writing the index repeatedly, and queue up big index changes that can happen all at once. The new git annex watch was not able to use that queue. Today I reworked the queue machinery to support the types of direct index writes it needs, and now repeated index writes are eliminated.

... Eliminated too far, it turns out, since it doesn't yet ever flush that queue until shutdown! So the next step here will be to have a worker thread that wakes up periodically, flushes the queue, and autocommits. (This will, in fact, be the start of the syncing phase of my roadmap!) There's lots of room here for smart behavior. Like, if a lot of changes are being made close together, wait for them to die down before committing. Or, if it's been idle and a single file appears, commit it immediatly, since this is probably something the user wants synced out right away. I'll start with something stupid and then add the smarts.

(BTW, in all my years of programming, I have avoided threads like the nasty bug-prone plague they are. Here I already have three threads, and am going to add probably 4 or 5 more before I'm done with the git annex assistant. So far, it's working well -- I give credit to Haskell for making it easy to manage state in ways that make it possible to reason about how the threads will interact.)

What about the races I've been stressing over? Well, I have an ulterior motive in speeding up git annex watch, and that's to also be able to slow it down. Running in slow-mo makes it easy to try things that might cause a race and watch how it reacts. I'll be using this technique when I circle back around to dealing with the races.

Another tricky speed problem came up today that I also need to fix. On startup, git annex watch scans the whole tree to find files that have been added or moved etc while it was not running, and take care of them. Currently, this scan involves re-staging every symlink in the tree. That's slow! I need to find a way to avoid re-staging symlinks; I may use git cat-file to check if the currently staged symlink is correct, or I may come up with some better and faster solution. Sleeping on this problem.


Oh yeah, I also found one more race bug today. It only happens at startup and could only make it miss staging file deletions.

Posted Fri Jun 15 11:16:42 2012
git merge watch_

My cursor has been mentally poised here all day, but I've been reluctant to merge watch into master. It seems solid, but is it correct? I was able to think up a lot of races it'd be subject to, and deal with them, but did I find them all?

Perhaps I need to do some automated fuzz testing to reassure myself. I looked into using genbackupdata to that end. It's not quite what I need, but could be moved in that direction. Or I could write my own fuzz tester, but it seems better to use someone else's, because a) laziness and b) they're less likely to have the same blind spots I do.

My reluctance to merge isn't helped by the known bugs with files that are either already open before git annex watch starts, or are opened by two processes at once, and confuse it into annexing the still-open file when one process closes it.

I've been thinking about just running lsof on every file as it's being annexed to check for that, but in the end, lsof is too slow. Since its check involves trawling through all of /proc, it takes it a good half a second to check a file, and adding 25 seconds to the time it takes to process 100 files is just not acceptable.

But an option that could work is to run lsof after a bunch of new files have been annexed. It can check a lot of files nearly as fast as a single one. In the rare case that an annexed file is indeed still open, it could be moved back out of the annex. Then when its remaining writer finally closes it, another inotify event would re-annex it.

Posted Fri Jun 15 11:16:42 2012

Today I worked on the race conditions, and fixed two of them. Both were fixed by avoiding using git add, which looks at the files currently on disk. Instead, git annex watch injects symlinks directly into git's index, using git update-index.

There is one bad race condition remaining. If multiple processes have a file open for write, one can close it, and it will be added to the annex. But then the other can still write to it.


Getting away from race conditions for a while, I made git annex watch not annex .gitignore and .gitattributes files.

And, I made it handle running out of inotify descriptors. By default, /proc/sys/fs/inotify/max_user_watches is 8192, and that's how many directories inotify can watch. Now when it needs more, it will print a nice message showing how to increase it with sysctl.

FWIW, DropBox also uses inotify and has the same limit. It seems to not tell the user how to fix it when it goes over. Here's what git annex watch will say:

Too many directories to watch! (Not watching ./dir4299)
Increase the limit by running:
  echo fs.inotify.max_user_watches=81920 | sudo tee -a /etc/sysctl.conf; sudo sysctl -p
Posted Fri Jun 15 11:16:42 2012

Kickstarter is over. Yay!

Today I worked on the bug where git annex watch turned regular files that were already checked into git into symlinks. So I made it check if a file is already in git before trying to add it to the annex.

The tricky part was doing this check quickly. Unless I want to write my own git index parser (or use one from Hackage), this check requires running git ls-files, once per file to be added. That won't fly if a huge tree of files is being moved or unpacked into the watched directory.

Instead, I made it only do the check during git annex watch's initial scan of the tree. This should be ok, because once it's running, you won't be adding new files to git anyway, since it'll automatically annex new files. This is good enough for now, but there are at least two problems with it:

  • Someone might git merge in a branch that has some regular files, and it would add the merged in files to the annex.
  • Once git annex watch is running, if you modify a file that was checked into git as a regular file, the new version will be added to the annex.

I'll probably come back to this issue, and may well find myself directly querying git's index.


I've started work to fix the memory leak I see when running git annex watch in a large repository (40 thousand files). As always with a Haskell memory leak, I crack open Real World Haskell's chapter on profiling.

Eventually this yields a nice graph of the problem:

memory profile

So, looks like a few minor memory leaks, and one huge leak. Stared at this for a while and trying a few things, and got a much better result:

memory profile

I may come back later and try to improve this further, but it's not bad memory usage. But, it's still rather slow to start up in such a large repository, and its initial scan is still doing too much work. I need to optimize more..

Posted Fri Jun 15 11:16:42 2012

Last night I got git annex watch to also handle deletion of files. This was not as tricky as feared; the key is using git rm --ignore-unmatch, which avoids most problimatic situations (such as a just deleted file being added back before git is run).

Also fixed some races when git annex watch is doing its startup scan of the tree, which might be changed as it's being traversed. Now only one thread performs actions at a time, so inotify events are queued up during the scan, and dealt with once it completes. It's worth noting that inotify can only buffer so many events .. Which might have been a problem except for a very nice feature of Haskell's inotify interface: It has a thread that drains the limited inotify buffer and does its own buffering.


Right now, git annex watch is not as fast as it could be when doing something like adding a lot of files, or deleting a lot of files. For each file, it currently runs a git command that updates the index. I did some work toward coalescing these into one command (which git annex already does normally). It's not quite ready to be turned on yet, because of some races involving git add that become much worse if it's delayed by event coalescing.


And races were the theme of today. Spent most of the day really getting to grips with all the fun races that can occur between modification happening to files, and git annex watch. The inotify page now has a long list of known races, some benign, and several, all involving adding files, that are quite nasty.

I fixed one of those races this evening. The rest will probably involve moving away from using git add, which necessarily examines the file on disk, to directly shoving the symlink into git's index.

BTW, it turns out that dvcs-autosync has grappled with some of these same races: http://comments.gmane.org/gmane.comp.version-control.home-dir/665 I hope that git annex watch will be in a better place to deal with them, since it's only dealing with git, and with a restricted portion of it relevant to git-annex.

It's important that git annex watch be rock solid. It's the foundation of the git annex assistant. Users should not need to worry about races when using it. Most users won't know what race conditions are. If only I could be so lucky!

Posted Fri Jun 15 11:16:42 2012
Comments on this page are closed.