Two problems become none

No truism is always true, not even this one. I recently clashed with two common conceptions in software engineering:

  • “All problems in computer science can be solved by another level of indirection.” – David Wheeler
  • “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” – Jamie Zawinski

The problem, in this case, is the heart of my little content management system, Utterson. As I discussed previously, I want all content to live in a Git repository, which is read and interpreted (later, also written) by the CMS. For example, I would create a blog using magic file extensions like this:

/development.blog/2010/08/28/some-interesting-post.markdown
    (Markdown-formatted post, first heading becoming the title)
/development.blog/2010/08/28/picture.jpg
    (picture included using Markdown image syntax)

Or a set of photo albums:

/photos.album/vacation-2010.markdown
    (album title and description)
/photos.album/vacation-2010/img5743.jpg
/photos.album/vacation-2010/img5744.jpg
    (photos, including a title and caption in their metadata)

The extensions on the files/directories indicate to Utterson how their content should be handled: a .blog extension makes subdirectories into “blog-year” directories, a directory inside a “blog-year” directory would become a “blog-month” directory, and so on; directories inside a .album directory would automatically become photo (sub)albums themselves. The extension of a file would indicate the type of the file’s contents (Markdown-formatted text, HTML, JPG, …)

So we’d identify an item by its “basename path”, being the path with all file extensions stripped. Each file and directory would have a set of “tags” associated with it. Tags can come either from the extension of the file/directory, or be computed from the parent’s tags. Furthermore, as the photo album example shows, sometimes we can have a file and a directory with the same name (apart from extensions). This should be treated as a unit: a directory containing content, or a file containing files.

This seems like a neat abstraction. The distinction between files and directories vanishes, extensions vanish, and we’re left with a tree structure where each node has a basename, tags, maybe some data, and maybe child nodes. The frontend of Utterson is then free to interpret the contents of the tree without worrying about filesystem details, e.g., parse a file labeled markdown and render it as HTML, showing a JPG file inside an album using a nice browsing interface, but showing a JPG file inside a blog directly for inclusion in an HTML page. It is also very modular, allowing new handlers for tags like forum or wiki later on, without affecting the base system.

But when I tried to implement this abstraction, I ran into many annoying implementation details. Recall that we’re not translating a filesystem into the abstract tree; we’re processing Git commits here, so we must translate modifications to the filesystem into modifications of the tree. What happens if both the file and the directory have extensions – do we merge them, or pick one of them? What happens if the directory gets deleted (Git does not track directories directly)? Do we detect file renames, or view them as a deletion and an addition? What if one half of a file/directory pair gets deleted – do we update the tags, thereby forcing a recursive tag update onto all child items? How do we handle empty directories? What to do with symlinks? All in all, this quickly became a nightmare, and I wrote and rewrote code that never quite worked in all cases.

Yes, it would have been a nice abstraction, which would allow me to split Utterson into three largely independent parts: tree representation, tree update, and tree interpretation. But there was just too much work, and at the end of the day, I need a CMS and I need it quick. So this additional level of indirection was perhaps not the way to go.

Enter the almighty regular expressions. Django, the framework that I’m using as a basis, handles URLs using a mapping like this:

(r'^(?P<blog_name>blog)/'
 r'(?P<year>\d+)/(?P<month>\d+)/(?P<day>\d+)/'
 r'(?P<slug>[^/]*)',
    'blog.views.post')

It may look a bit scary, but this simply tells Django that a URL path of the form blog/yyyy/mm/dd/slug should be handled by the view function blog.views.post (which shows that particular post). The (?P<...>...) constructions assign names to certain parts of the URL, which automatically get passed as arguments to the view function. All that that function needs to do is look up the post in the database by its year/month/day/slug, and stuff it into a template.

Couldn’t I do something similar with regexes? Simply write a bunch of regexes that tell Utterson how to handle a particular file, based on its path? After all, if an established and widely used framework like Django does it this way, how bad can it be? So this is exactly what I did:

(r'(?P<blog_name>blog)/'
 r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})-'
 r'(?P<slug>.*)\.(md|markdown)$',
    'blog', 'Post')

This tells Utterson that a file named blog/yyyy-mm-dd-slug.markdown should be treated as a Post object from the module blog. (As you see, I ended up not using subdirectories for year/month/day.) The named parts of the regex get passed to the Post’s constructor, and Django takes care of the rest. The Post model itself is a pretty standard Django Model object, stored in an SQL database, except that it inherits from a model RepoFile. The RepoFile stores Git-related properties such as the filename, SHA1 of the file’s content blob, last-modified date and author.

To make the repository completely self-describing, I put this mapping into a Python module named _paths.py inside the repository itself. (The underscore is a convention that makes a file invisible to Utterson. This can be used to store draft posts, for example.)

This approach allowed me to implement the static pages and the blog in just one day, mostly using standard Django machinery. There is an appealing symmetry now: regexes map files to model objects, and regexes map URLs to model objects. Do I now have two problems? I don’t think so.