By far, the worst part of working on beets is dealing with filenames. And since our job is to keep track of your files in the database, we have to deal with them all the time. This post describes the filename problems we discovered in the project’s early days, how we address them now, and some alternatives for the future.
Let’s say you think it should be text. The OS should tell us what encoding it’s using, and we get to treat its paths as human-readable strings. So the correct type is
unicode on Python 2 or
str on Python 3.
Here’s the thing, though: on Unixes, paths are fundamentally bytes. The arguments and return types of the standard Posix OS interfaces
opendir(2) use C
char* strings (because we still live in 1969).
This means that your operating system can, and does, lie about its filesystem encoding. As we discovered in the early days of beets, Linuxes everywhere often say their filesystem encoding is one thing and then give you bytes in a completely different encoding. The OS makes no attempt to avoid handing arbitrary bytes to applications. If you just call
fn.decode(sys.getfilesystemencoding()) in attempt to make turn your paths into Unicode text, Python will crash sometimes.
So, we must conclude that paths are bytes. But here’s the other thing: on Windows, paths are fundamentally text. The equivalent interfaces on Windows accept and return wide character strings—and on Python, that means
unicode objects. So our grand plan to use bytes as the one true path representation is foiled.
It gets worse: to use full-length paths on Windows, you need to prefix them with the four magical characters
\\?\. Every time. I know.
This contradiction is the root of all evil. It’s the cause of a huge amount of fiddly boilerplate code in beets, a good number of bug reports, and a lot of sadness during our current port to Python 3.
In beets, we adhere to a set of coding conventions that, when ruthlessly enforced, avoid potential problems.
First, we need one consistent type to store in our database. We picked bytes. This way, we can consistently represent Unix filesystems, and it requires only a bit of hacking to support Windows too. In particular, beets encodes all filenames using a fixed encoding (UTF-8) on Windows, just so that the path type is always bytes on all platforms.
To make this all work, we use three pervasive little utility functions:
bytestring_pathto force all paths to our consistent representation. If you don’t know where a path came from, you can just pass it through
bytestring_pathto rectify it before proceeding.
displayable_path, must be used to format error messages and log output. It does its best to decode the path to human-readable Unicode text, and it’s not allowed to fail—but it’s lossy. The result is only good for human consumption, not for returning back to the OS. Hence the name, which is intentionally not
listdirmust pass through the third utility:
syspath. Think of this as converting from beets’s internal representation to the OS’s own representation. On Unix, this is a no-op: the representations are the same. On Windows, this converts a bytestring path back to Unicode and then adds the ridiculous
\\?\prefix, which avoids problems with long names.
It’s not fun to force everybody to use these utilities everywhere, but it does work. Since we instated this policy, Unicode-related bugs still happen, but they’re not nearly as pervasive as they were in the project’s early days.
Although our solution works, I won’t pretend to love it. Here are a few alternatives we might consider for the future.
Python 3 chose the opposite answer to the root-of-all-evil contradiction: paths are always Unicode strings, not bytes. It invented surrogate escapes to represent bytes that didn’t fit the platform’s purported filesystem encoding. This way, Python 3’s Unicode
str can represent arbitrary bytes in filenames. (The first commit to beets happened a bit before Python 3.0 was released, so perhaps the project can be forgiven for not adopting this approach in the first place.)
A few lingering details still worry me about surrogate escapes:
One day, I believe we will live in a world where everything is UTF-8 all the time. We could hasten that glorious day by requiring that all paths be UTF-8 and either rejecting or fixing any other filenames as they come into beets. For now, though, this seems a just tad user-hostile for a program that works so closely with your files.
We could switch to Python 3’s pathlib module. We’d still need to choose a uniform representation for putting these paths into our database, though, and it’s not clear how well the Python 2 backport works. But we do have a ticket for it.