beets: flexible attributes, in which schemaless and schemaful coexist peacefully

Since the beginning, beets has been conceptually built around an object–field data model. Your library is made up of item objects, which represent audio files, and each item has a bunch of fields corresponding to tags on those files: “artist,” “title,” and “year,” for example.

Over the years, beets has grown to support a long list of fields: by my count, we currently support 59. We’ve used an informal policy of adding almost every reasonable field that a user requests. But even this liberal approach can be too constricting: adding a new field at least requires getting the attention of a developer and then waiting for another beets release to roll around.

But what if users could add any field they could think of, without needing to rely on changes to beets itself? Decoupling the list of fields from the beets schema would enable awesome new use cases, great new plugins that aren’t possible today, and even simplify the development of some oft-requested features.

Flexible Attributes

The next version of beets, 1.3, overhauls the internals to support flexible attributes (affectionately known as flexattrs). You’re no longer constrained to the fields that beets ships with: you can now store information in any field name at all. For example, maybe you’d like to start classifying your music based on its mood. Even though beets itself doesn’t use a mood field, you can set it on your music using the modify command:

$ beet modify mood=sexy artist:miguel

Then, later, if you want to see all your music whose mood is “sexy,” just use the list command:

$ beet ls mood:sexy

This is exciting enough on its own—you can classify music based on any criteria you can think of—but it also means many more possibilities for plugins in the future. (Here a are few ideas.)

How it Works

I procrastinated for a long time in implementing flexible attributes because a clean, maintainable implementation was surprisingly tricky to get right. (This post is going to get a little nerdy from here on out; if you’re interested in the technical details, read on.)

As background, beets libraries are stored in SQLite databases. The schema is simple: there’s an items table and an albums table, and each field is a column on these tables. To support flexattrs, I initially considered switching to a schema-free database: abusing SQLite as a key–value store, serializing the whole database to JSON, LevelDB, or the like. With the help of other developers, however, I eventually realized that we could tack on flexible attributes alongside the existing fixed attributes and minimize the disruption.

Here’s what it looks like from a developer’s perspective. If you write:

item.artist = 'Rhye'

beets updates the fixed artist attribute, which is backed by an SQLite column. If you make up your own field name:

item.mood = 'ethereal'

then beets detects that you’re defining a flexattr and updates a separate key–value table instead. The idea here is that code that accesses fixed or flexible attributes should look identical—that way, we can prototype code using flexible attributes and eventually promote the attribute to built-in field with zero changes to the client code. Three cheers for sound abstractions!

Queries, Fast and Slow

Things get slightly more complicated when querying. For example, to query the database for songs by Miguel, beets constructs a SQL query like:

SELECT * FROM items WHERE artist = 'Miguel';

This query can be fast because it’s performed entirely by the native SQLite library; no Python is involved. But to match on a flexible attribute, we’d need to generate complex query with a JOIN on the flexattr table. This gets especially hairy when wildcard-matching on any field or using a special query type like regex queries or fuzzy queries. To avoid this complexity, beets instead evaluates queries involving flexattrs in Python. It fetches every item in the database and “manually” checks the query against each row. This is probably slower than the SQLite query used for purely fixed attribute queries, but it’s probably not much slower—if at all—than the complex JOIN on the flexattr table that would otherwise be required.

To support flexattr queries, beets now classifies all queries into fast and slow queries. As always, Query objects have a clause() method that returns the SQLite WHERE expression that evaluates them and a match(item) method that returns a boolean indicating whether a particular item passes the query. With the flexattr changes, the clause() method can now return None, indicating that the query is slow and cannot be evaluated in SQLite. The library then falls back to one-at-a-time predication.

This fast/slow distinction also had the knock-on effect of simplifying the way beets handles “special” queries like regexes and fuzzy matching. These query types can’t easily be written in SQLite, so they’re implemented as Python predicates. Previously, to shoehorn everything into a SQLite query, we had to register a Python function as a callback and then name the function in the clause() result. Crossing the Python/C barrier is not great for performance, so sticking to SQLite queries here was likely not buying us much. Now, any query involving a complex predicate implemented in Python is just evaluated using the slow path—eliminating the need for callback registration.

Next Steps

I’m happy with the overall design for flexible attributes, but there’s a laundry list of things we can do better with over time:

Ideally, it should be easy to “promote” a flexible attribute to a built-in fixed attribute. This almost works already, but we’ll need a way to transparently move the data from the flexattr tables to the main item and album tables. We can cross this bridge when some field eventually needs migration.
A query that matches on both a fixed attribute and a flexible attribute, like artist:miguel mood:sexy, is currently classified as “slow”—any slow component makes the whole query slow. This could be optimized by matching the fast component in SQLite and then filtering on the slow component in Python. In general, we need to measure the performance of this feature to see whether optimizations like this are necessary.
I’m not ruling out an eventual move to a completely schema-free design, perhaps using a simple key–value store like LevelDB or a putative SQLite4. Eliminating the distinction between the two types of attributes will pay off in simplicity.
Beets needs a system for specifying value types. Currently, we have an ad-hoc data type classification that, for example, knows to print bitrates in KHz and track numbers with zero padding. We need a more principled way to specify conversions to and from strings—at the same time, we should let plugins specify this type information for attributes they define.

This post was published on September 2, 2013. Have comments or questions? Post on the discussion board, toot at @beets@fosstodon.org, or discuss on Hacker News.

This is Beets

Beets is the media library management system for obsessive music geeks. Watch a screencast to learn more.

Get Beets

Install with pip by typing pip install beets, then read the Getting Started guide.

Project

Blog

August 20: a guided walkthrough
June 19: we’re pretty happy with SQLite & not urgently interested in a fancier DBMS
June 4: our solution for the hell that is filename encoding, such as it is

all posts…

Contact

Need help with beets? Have comments or questions? Post on the discussion board or file an issue on GitHub. You can donate with PayPal or Bitcoin, but please consider a donation to Move to Amend instead. Thanks!

The beets blog: flexible attributes, in which schemaless and schemaful coexist peacefully.