beets: our solution for the hell that is filename encoding, such as it is

By far, the worst part of working on beets is dealing with filenames. And since our job is to keep track of your files in the database, we have to deal with them all the time. This post describes the filename problems we discovered in the project’s early days, how we address them now, and some alternatives for the future.

What is a Path?

What would you say a path is? In Python terms, what should the type of the argument to open or os.listdir be?

Let’s say you think it should be text. The OS should tell us what encoding it’s using, and we get to treat its paths as human-readable strings. So the correct type is unicode on Python 2 or str on Python 3.

Here’s the thing, though: on Unixes, paths are fundamentally bytes. The arguments and return types of the standard Posix OS interfaces open(2) and opendir(2) use C char* strings (because we still live in 1969).

This means that your operating system can, and does, lie about its filesystem encoding. As we discovered in the early days of beets, Linuxes everywhere often say their filesystem encoding is one thing and then give you bytes in a completely different encoding. The OS makes no attempt to avoid handing arbitrary bytes to applications. If you just call fn.decode(sys.getfilesystemencoding()) in attempt to make turn your paths into Unicode text, Python will crash sometimes.

So, we must conclude that paths are bytes. But here’s the other thing: on Windows, paths are fundamentally text. The equivalent interfaces on Windows accept and return wide character strings—and on Python, that means unicode objects. So our grand plan to use bytes as the one true path representation is foiled.

It gets worse: to use full-length paths on Windows, you need to prefix them with the four magical characters \\?\. Every time. I know.

This contradiction is the root of all evil. It’s the cause of a huge amount of fiddly boilerplate code in beets, a good number of bug reports, and a lot of sadness during our current port to Python 3.

Our Conventions

In beets, we adhere to a set of coding conventions that, when ruthlessly enforced, avoid potential problems.

First, we need one consistent type to store in our database. We picked bytes. This way, we can consistently represent Unix filesystems, and it requires only a bit of hacking to support Windows too. In particular, beets encodes all filenames using a fixed encoding (UTF-8) on Windows, just so that the path type is always bytes on all platforms.

To make this all work, we use three pervasive little utility functions:

We use bytestring_path to force all paths to our consistent representation. If you don’t know where a path came from, you can just pass it through bytestring_path to rectify it before proceeding.
The opposite function, displayable_path, must be used to format error messages and log output. It does its best to decode the path to human-readable Unicode text, and it’s not allowed to fail—but it’s lossy. The result is only good for human consumption, not for returning back to the OS. Hence the name, which is intentionally not unicode_path.
Every argument to an OS function like open or listdir must pass through the third utility: syspath. Think of this as converting from beets’s internal representation to the OS’s own representation. On Unix, this is a no-op: the representations are the same. On Windows, this converts a bytestring path back to Unicode and then adds the ridiculous \\?\ prefix, which avoids problems with long names.

It’s not fun to force everybody to use these utilities everywhere, but it does work. Since we instated this policy, Unicode-related bugs still happen, but they’re not nearly as pervasive as they were in the project’s early days.

Must It Be This Way?

Although our solution works, I won’t pretend to love it. Here are a few alternatives we might consider for the future.

Python 3’s Surrogate Escapes

Python 3 chose the opposite answer to the root-of-all-evil contradiction: paths are always Unicode strings, not bytes. It invented surrogate escapes to represent bytes that didn’t fit the platform’s purported filesystem encoding. This way, Python 3’s Unicode str can represent arbitrary bytes in filenames. (The first commit to beets happened a bit before Python 3.0 was released, so perhaps the project can be forgiven for not adopting this approach in the first place.)

A few lingering details still worry me about surrogate escapes:

Migrating people’s databases containing old bytes paths to surrogate-escaped strings won’t exactly be fun.
Might surrogate escapes tie us too much to the Python ecosystem? What happens when you try to send one of these paths to another non-Python tool that interacts with the same filesystem?
People in the Python community have misgivings about the current implementation of surrogate escapes. Nick Coghlan summarizes. We’ll need to investigate the nuances ourselves.

Require UTF-8 Everywhere

One day, I believe we will live in a world where everything is UTF-8 all the time. We could hasten that glorious day by requiring that all paths be UTF-8 and either rejecting or fixing any other filenames as they come into beets. For now, though, this seems a just tad user-hostile for a program that works so closely with your files.

Pathlib

We could switch to Python 3’s pathlib module. We’d still need to choose a uniform representation for putting these paths into our database, though, and it’s not clear how well the Python 2 backport works. But we do have a ticket for it.

This post was published on June 4, 2016. Have comments or questions? Post on the discussion board, toot at @beets@fosstodon.org, or discuss on Hacker News.

This is Beets

Beets is the media library management system for obsessive music geeks. Watch a screencast to learn more.

Get Beets

Install with pip by typing pip install beets, then read the Getting Started guide.

Project

Blog

August 20: a guided walkthrough
June 19: we’re pretty happy with SQLite & not urgently interested in a fancier DBMS
June 4: our solution for the hell that is filename encoding, such as it is

all posts…

Contact

Need help with beets? Have comments or questions? Post on the discussion board or file an issue on GitHub. You can donate with PayPal or Bitcoin, but please consider a donation to Move to Amend instead. Thanks!

The beets blog: our solution for the hell that is filename encoding, such as it is.