A coding story: Dead reckoning vs. diffs

Once upon a time, I wrote a photo gallery in PHP, accompanied by a set of PHP and SQL scripts to populate the site from a KimDaBa (now KPhotoAlbum) database. The code is *terrible*, and when it came time to adjust how I managed photo tags, I couldn't stomach editing the PHP. Slowly but surely, I've been rewriting my photo gallery software. The website is in Clojure, and mostly up to feature parity. The tricky part is the updater -- because of an odd quirk of KPhotoAlbum.

This is the story of trying to solve a problem, finding out a flaw, and then seeing the problem in another light. The intended audience is unclear. :-)

A quick note on workflow

The kpawebgen software currently has three phases: read reads the KPA DB and mirrors it into a shadow DB with some additions; localweb creates a gallery DB and image files for a local copy of the photo gallery; pub-s3 publishes the gallery DB and images to Amazon S3 for the production photo gallery.

Shifting sands

KPhotoAlbum (or KPA) is a photo management application that maintains an index.xml file with entries for all the images in a directory tree. Each image can be tagged; each tag lives in a category, and tags can have supertags such that they form a directed acyclic graph within the category. For example, a photo might have the tag "Boston, MA" in the "Location" category, and that tag might have a supertag of "Massachusetts". Filtering on either tag will find the photo. It's a very powerful feature, but that's not what I'm here to talk about today. The trouble is one of identity.

KPA stores two (relevant) things about each photo: The path on disk to the image, and the MD5 hash of its contents. (Notably, KPA does not modify images to store metadata—the hash should stay constant unless explicit edits are made.) If the image moves between two KPA sessions, KPA recognizes that the "new" image and the "missing" image have the same hash, and updates the path field in the XML element. If the image is edited and the user clicks "recalculate checksums", KPA sees that the checksum has changed and updates that field. And here lies the problem: These two fields constitute a composite ID for the image, and the two parts can change. There is no permanent, unique ID, and that's going to make it tough to have stable URLs for a photo gallery!

The shadow database

This is where the Python portion of kpawebgen comes in. The read action maintains a shadow database that represents the KPA index with a little extra metadata: Unique, persistent IDs for the images, a note on when each image first appeared, and a changelog. When read is run, kpawebgen compares index.xml to shadow.db3 and computes differences. For all mismatches, if a hash is present in both but has a differing path, it is recorded in the changelog as a move. Conversely, the same path but different hash is recorded as an edit. Remaining mismatches are recorded as create and delete. (For completeness, images that have been marked by KPA as rotated also have rotation changes recorded in the changelog.) Over time, the changelog should consist mainly of creates, with a handful of deletes, edits, and moves.

Given a shadow database, it's not too hard to produce a gallery: Make a modified copy of the shadow database (only include photos tagged public, and not private; do some joins on tags and categories to make them more queryable; filter out private tags from some categories) and generate sized photos (thumbnail, medium, original). Shove the images out to a CDN and ship the gallery database out where the gallery code can see it. Done. Except... do you really want to generate 12000 images each time you publish the gallery? And pay the time and money costs to upload them to a CDN? This would be very wasteful. How to avoid unnecessary work?

Dead reckoning

My current approach is to use dead reckoning. This is what the localweb task currently does. I store the shadow DB's last changelog ID in a metadata table of the local copy of the gallery DB, so localweb can see what needs to change. The task updates the regular image metadata, then walks the changelog: Deletes cause sized images to be deleted; creates, edits, and rotates cause sized images to be created or overwritten. Similarly, the pub-s3 task stores a last changelog ID and replays the changes by uploading images to S3 or deleting them from the bucket. This is all very well and good, except... I looked at my gallery, and some images are missing!

This doesn't mean there's a bug in the code. It means there is or was a bug, and now the shadow DB and the sized images are out of sync. Even if the code is perfect now, the dead reckoning approach means that once I'm "off course" (out of sync), I'll stay out of sync.

This is the current state of things. Beyond this is speculation and planning.

Diffs

My first inclination was to add a cleanup or verification step for debugging, but I realized that if I wrote that code, it would actually be sufficient to replace the dead reckoning code. However, I need to make a change in how sized images are stored on disk and in the DB. Currently the filenames only have the image ID and the type (thumbnail, medium, or original), which isn't enough to tell me if a change needs to happen. I'm planning on changing them to a format that allows me to detect what changes have occurred simply by inspecting the DB and the directory listing, namely id_[image ID]-src_[original image hash]-rot_[rotation]-cnf_[config hash]-dim_[size variant].jpg (example: id_123-src_45abc67-rot_90-cnf_89def01-dim_med.jpg). If an image is edited, the image hash changes; if the rotation metadata in KPA's index changes, so does the corresponding filename; each image is still marked with the max size for that variant; and the configuration in effect for thumbnailing and watermarking is hashed and branded on the filename as well.

The algorithm will go as follows:

  1. Hash the configuration for thumbnailing and watermarking in a consistent way, as the config hash. (Might have one hash per size variant.)
  2. For all images to be displayed, compute the target filename and store in DB.
  3. Find all files in the target directory matching the pattern.
  4. Where there are mismatches, write and delete files.

Effects on other parts of the system:

  • The pub-s3 task can now simply compute a directory diff and upload and delete as necessary.
  • Both localweb and pub-s3 are now resumable—if I kill the program and restart, the gallery DB can be recreated entirely and any remaining image diff can be resolved.
  • The gallery website will need to retrieve filenames along with other image metadata when displaying images.
  • The images themselves will not have stable URLs, but the ID can be found in the URL as a last resort. Redirect URLs could be provided to counteract this decay somewhat.
  • If I change my thumbnail sizing, watermarks (once that's implemented), or other image production settings, kpawebgen will automatically recreate images.
  • Bugs in sync code will be recoverable.

That's the plan! We'll see how it plays out.

Conclusion

It's a little early to say, but: Just because you have some data (such as a changelog) doesn't mean you should use it; small problems (lack of image recreation on watermark changes) can point to larger ones (overall sync issues); sometimes you have to build the wrong thing before you can build the right thing (just hope that's before you release it!)


Comments are closed.