Making redo rebuild when directory contents change

I recently discovered redo, a replacement for make. It is based on a totally different and far simpler concept of rebuilding, but it has one deficiency in common with Makefiles: There's no built-in way to rebuild when a directory has been modified.

Updated 2020-09-10 with a much, much faster version.

In my situation, I have a Java project built with Maven and I want to rebuild a "fat jar" (a jar file containing compiled code along with all dependencies) every time either the dependencies or the source code change. Directories and open-ended lists like these also don't really fall inside of redo's build model; there's no way to declare "this jar file depends on whatever files happen to be in this dir". Here's the naïve file that I used at first for redo-ifchange spelunk.fat.jar:

redo-ifchange pom.xml
redo-ifchange src/ # XXX -- does nothing!
mvn package >&2
cp target/spelunk-*-jar-with-dependencies.jar $3

If you're not familiar with redo, I'll explain what's happening here. This is just an ordinary Bash script. The first line is a call to the redo-ifchange script that declares a dependency on pom.xml. If the file doesn't exist or its own dependencies have changed, it is rebuilt, as long as there's a file. (There isn't, but I could do it.) The second line is similar—redo sees that the directory exists and therefore does nothing, but marks it as a dependency. Then we get into the meat of the file. Maven is used to build the jar file, piping output to stderr, and the resulting file is copied to the temp path redo supplied as argument 3. Afterwards, redo will move that file into the correct location.

As is, this script partially works. If I change the pom.xml, redo notices and an invocation of redo-ifchange spelunk.fat.jar will cause this script to be run. But if I change a file in the src directory, redo doesn't see it as a changed dependency. The directory hasn't changed, after all! Just some file down inside it.

My solution is to create a file:

find src/ -printf '%t %s %i %m %U %G %P\0' | redo-stamp

The first time I call redo-ifchange .src-stamp, redo will run this script, which collects relevant metadata from everything in the src file tree. The listing is piped to redo-stamp, which hashes its input and makes a note of it. "Stamped" targets are special, declaring that redo should skip its usual change-detection protocol and instead take the "stamp" value in lieu of timestamps, inode values, file sizes, etc. Marking the target with redo-always then ensures that the stamp is recalculated every time. (I'm actually a little hazy on the semantics of precisely when redo-always is needed with stamping. In any case, this is what you want.)

The .src-stamp file is never written, because this do-file never writes to stdout or to $3. It exists purely as a sort of virtual dependency. And now it can be used in my main do-file:

redo-ifchange pom.xml .src-stamp
mvn package >&2
cp target/spelunk-*-jar-with-dependencies.jar $3

The call to redo-ifchange .src-stamp says "redo the jar if this .src-stamp has changed" but of course redo can't know if the target (or its theoretical dependencies) changed until it runs the script. That's the point of stamping—"leave that to me, I'll tell you". So redo will always call and recompute the stamp. (You'll note that I've also combined the two ifchange lines into one. This allows parallelization if that later becomes useful.)

One key aspect of this problem is that you can't tell in advance what files will be used in making the jar file. redo actually has a clever way of side-stepping this in some situations. For example, if your build tool has a way of logging what files it used, your do-file can throw those into redo-ifchange afterwards, and redo will make a note of that. But there's no way to notice that a new build-affecting file has been added and no other files changed, which can matter for Java compilation. Hence the hack.

But what's the performance like? The first pass of this code was only usable on relatively small file trees, but the optimized version above is really quite fast. Here's a history of what I used and how fast it could stamp a directory containing every git repo I've cloned, a 100,000 file strong directory tree weighing in at 2.3 GB. The timings were acquired on a 10 year old laptop with a hard disk, so your numbers might be better.

  • My original code was find ~/repos -type f -print0 | sort -z | xargs -0 -l1 sha256sum and it took 66 seconds to run. This only checks file contents, but not file mode or other attributes, which might matter for some use-cases.
  • Next I decided to switch to stat on all files, since that seemed like it would be faster: find ~/repos -print0 | sort -z | xargs -0 -l1 stat It actually took nearly twice as long, 144 seconds, and it's not even looking at file contents. (Which is fine except in really weird situations, and it's using everything redo itself uses.) But there were other improvements to make.
  • Invoking the stat process once for each file is horribly inefficient, and calling stat file1; stat file2 produces the exact same output as stat file1 file2. So I told xargs to pass it 1000 at a time: find ~/repos -print0 | sort -z | xargs -0 -l1000 stat 6 seconds! That's very respectable.
  • There's no good reason to run sort on the file list. The output of find seems to be deterministic when files aren't changing, and a very occasional false-positive due to periodic disk compactions would be acceptable. If your use-case involves a weird filesystem where this doesn't hold true, maybe don't go down this road, though. find ~/repos -print0 | xargs -0 -l1000 stat gives me 4.5 seconds, which isn't a very dramatic improvement. (For comparison, making the same improvements up to this point with the file contents hashing approach gives a 15 second run: find ~/repos -type f -print0 | xargs -0 -l1000 sha256sum)
  • However, removing sort also means I can use find's printf action to obviate having to call stat. This avoids a process call and find already has a file handle open. I can also specify which pieces of information I want, although I could have done that with stat as well. I'm not actually sure which of those things makes a difference, but the final version find ~/repos -printf '%t %s %i %m %U %G %P\0' runs in just 0.6 seconds, even on this huge file tree.

Less than a second for all of my git repos. Not bad at all.

No comments yet. Commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can email me and I can manually add comments. Feed icon