DM-29196 raises the prospect of our first post-schema-stability migration, and we have a few other tickets/projects waiting in the wings that will require migrations as well. At the same time, we've only just finished setting up our first long-lived shared data repositories for the rest of the DM team, and we need to be sure that our migration plan does not disrupt the active work using those repositories or abandon the datasets in them. This ticket attempts to gather use cases and other design constraints for putting together that migration system.
I have no idea how much of this Alembic might solve.
Changing dataset IDs from autoincrement integers to UUIDs is almost a maximal schema change. It changes the type of a major primary key / foreign key column, and hence requires new values to be created and used consistently while converting the tables.
It's also accompanied by a change in the meaning of some of those values (i.e. the dataset IDs for at least some raw files, which we now want to be deterministic), with that meaning defined downstream of daf_butler, in obs_base; ideally that would mean that at least some of the migration script content would be defined outside daf_butler, too. This is probably best as a Python script, since that's where we ultimately plan to generate UUIDs.
Because this change is controlled by a new Registry Manager that is set in configuration, backwards compatibility (new code reading old repos) is not a problem.
Forward compatibility (old code reading new repos) is a problem, but one that we are not required to address. Forward compatibility is one reason - of many, I think - to add this functionality well before we make it the default for new repos; hopefully by the time we create stable repos with this new Manager, software versions that don't have it will be old enough that we don't care much about it.
Fully addressing this ticket will require adding some new indexes and probably some foreign keys to existing tables (in this case, "dynamic" tables created when dataset types are registered).
This should be fully forwards- and backwards-compatible; new code should continue to be able to use old repos without the indexes/FKs (without benefiting from them, of course), and old code will be able to use new repos that contain them. It should be recorded as a patch version bump in the dataset Manager class.
However, we will still need a migration script - old repos definitely want these new indexes. Because these are dynamic tables, we'll probably want this to be a Python script so we can centralize the logic for how to find these tables and reason about their columns, even though a pure-SQL script would be viable for any one table.
This migration is also complicated by the fact that it's a version bump to the Manager class that
DM-29196 will ultimately replace, and it's just as applicable to the new manager introduced there. So we'll probably want to bump the versions of both manager classes (depending on the order in which these tickets are done), and think a bit about how we might order the migrations, which will not be orthogonal (the script to add the indexes will depend on the Manager class, albeit trivially).
It's worth adding that this is a very high-priority change - it's a big performance bottleneck in processing against the stable repos right now, and our first guess at another big slowdown in general queries. And it's unclear how much a Python-only fix that doesn't add any indexes can do (though I'll certainly start with that).
This is a straightforward dimension change, in which a new dimension ("amplifier") is added and no existing dimensions are modified. It's completely covered by a configuration change, and hence we get complete forwards and backwards compatibility (at least back to the version at which we declared the repository format stable).
We will still want a migration script - old repos will want the new dimension. That migration script will need to create the new tables, populate them for all instruments already registered (that's more migration script content that belongs downstream of daf_butler), and update the dimension configuration that's saved in the database accordingly.
This adds a new column to a single static table. It's only used for behavior we could consider optional, so this would ideally be something we could represent via a patch version bump to the two collection Manager classes, declaring it forwards and backwards compatible.
But whether that's actually possible depends on what happens when our current code:
- connects to a database that has an extra column in a table;
- tries to insert into the database while omitting that column.
I think we could define a default value for the column to make the latter work, but I just don't know about the former.
The migration script situation here is complicated slightly by the fact that the change applies to two Manager classes (though the same script might work for both).
This is a low-priority ticket, but possibly a good "otherwise-simple" ticket that could be used as a pathfinder for migration tooling.
I think we need to have a clear place to put any script that migrates between two different repository formats version, where the format version incorporates a number of other versions (e.g. via a hash digest):
- overall daf_butler schema version
- manager names and versions
- dimension configuration
I worry a bit that this doesn't adequately capture migrations that are partially or only about content rather than schema - e.g. the algorithm for generating UUIDs for raws - but maybe it's best to assume those are always accompanied by an actual schema version change for now, or just not use the same organizational structure for those that aren't.
We absolutely would not try to create a script for any pair of format versions, and at this point I don't think we make any promises about which pairs we would write a script for, except that we will write the scripts necessary to migrate the stable NCSA and IDF (so far) repositories. Maybe we could also promise to provide scripts that upgrade along the (linear) path of the daf_butler configuration defaults.
I think there are a few options for how to have some migration content in obs_base (or somewhere else downstream of daf_butler; I'll assume obs_base for simplicity now), but none of them seem super great. We could have a parallel tree of migration scripts there that could extend (via something like inheritance) the daf_butler ones. We could have configuration in the daf_butler ones that need to be populated with strings that resolve to Python classes or configuration resources in obs_base. Or we could have daf_butler just provide library code for organizing migrations and writing migration scripts, with the only real tree of concrete migration scripts existing in obs_base.
I think fundamentally we want this to be a Python callable or class that's given only the URI to the data repository and goes from there.
Once we have a Python API, we can wrap it up in whatever command-line tooling we think is useful.
Migration scripts probably need to be configurable sometimes, but we don't want to demand that users provide too much extra information to perform a migration - this is more about being able to demand information from a downstream package like obs_base.
I think we want all migrations to work on a single data repository in-place both because we probably do want to perform small migrations in place, and because out-of-place migrations are probably best expressed by some efficient, backend-specific copy (e.g PostgreSQL database dump and GPFS hard-links) followed by the in-place migration.
Migration scripts should be able to assume that they are the only thing operating on that data repository, and they don't need to work in a way that's guaranteed to be atomic; if necessary, we'll schedule downtime for in order to run them. But being able to work atomically on a live repository without disrupting others is a nice feature to have, and we should have a way to at least document the migrations for which that is the case.