# Design migration system for data repositories

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Team:
Data Access and Database
• Urgent?:
No

#### Description

DM-29196 raises the prospect of our first post-schema-stability migration, and we have a few other tickets/projects waiting in the wings that will require migrations as well. At the same time, we've only just finished setting up our first long-lived shared data repositories for the rest of the DM team, and we need to be sure that our migration plan does not disrupt the active work using those repositories or abandon the datasets in them. This ticket attempts to gather use cases and other design constraints for putting together that migration system.

I have no idea how much of this Alembic might solve.

## Use cases

### DM-29196

Changing dataset IDs from autoincrement integers to UUIDs is almost a maximal schema change. It changes the type of a major primary key / foreign key column, and hence requires new values to be created and used consistently while converting the tables.

It's also accompanied by a change in the meaning of some of those values (i.e. the dataset IDs for at least some raw files, which we now want to be deterministic), with that meaning defined downstream of daf_butler, in obs_base; ideally that would mean that at least some of the migration script content would be defined outside daf_butler, too. This is probably best as a Python script, since that's where we ultimately plan to generate UUIDs.

Because this change is controlled by a new Registry Manager that is set in configuration, backwards compatibility (new code reading old repos) is not a problem.
Forward compatibility (old code reading new repos) is a problem, but one that we are not required to address. Forward compatibility is one reason - of many, I think - to add this functionality well before we make it the default for new repos; hopefully by the time we create stable repos with this new Manager, software versions that don't have it will be old enough that we don't care much about it.

### DM-29562

Fully addressing this ticket will require adding some new indexes and probably some foreign keys to existing tables (in this case, "dynamic" tables created when dataset types are registered).

This should be fully forwards- and backwards-compatible; new code should continue to be able to use old repos without the indexes/FKs (without benefiting from them, of course), and old code will be able to use new repos that contain them. It should be recorded as a patch version bump in the dataset Manager class.

However, we will still need a migration script - old repos definitely want these new indexes. Because these are dynamic tables, we'll probably want this to be a Python script so we can centralize the logic for how to find these tables and reason about their columns, even though a pure-SQL script would be viable for any one table.

This migration is also complicated by the fact that it's a version bump to the Manager class that DM-29196 will ultimately replace, and it's just as applicable to the new manager introduced there. So we'll probably want to bump the versions of both manager classes (depending on the order in which these tickets are done), and think a bit about how we might order the migrations, which will not be orthogonal (the script to add the indexes will depend on the Manager class, albeit trivially).

It's worth adding that this is a very high-priority change - it's a big performance bottleneck in processing against the stable repos right now, and our first guess at another big slowdown in general queries. And it's unclear how much a Python-only fix that doesn't add any indexes can do (though I'll certainly start with that).

### DM-29584

This is a straightforward dimension change, in which a new dimension ("amplifier") is added and no existing dimensions are modified. It's completely covered by a configuration change, and hence we get complete forwards and backwards compatibility (at least back to the version at which we declared the repository format stable).

We will still want a migration script - old repos will want the new dimension. That migration script will need to create the new tables, populate them for all instruments already registered (that's more migration script content that belongs downstream of daf_butler), and update the dimension configuration that's saved in the database accordingly.

### DM-29585

This adds a new column to a single static table. It's only used for behavior we could consider optional, so this would ideally be something we could represent via a patch version bump to the two collection Manager classes, declaring it forwards and backwards compatible.

But whether that's actually possible depends on what happens when our current code:

• connects to a database that has an extra column in a table;
• tries to insert into the database while omitting that column.

I think we could define a default value for the column to make the latter work, but I just don't know about the former.

The migration script situation here is complicated slightly by the fact that the change applies to two Manager classes (though the same script might work for both).

This is a low-priority ticket, but possibly a good "otherwise-simple" ticket that could be used as a pathfinder for migration tooling.

## Organizing Migration Scripts

I think we need to have a clear place to put any script that migrates between two different repository formats version, where the format version incorporates a number of other versions (e.g. via a hash digest):

• overall daf_butler schema version
• manager names and versions
• dimension configuration

I worry a bit that this doesn't adequately capture migrations that are partially or only about content rather than schema - e.g. the algorithm for generating UUIDs for raws - but maybe it's best to assume those are always accompanied by an actual schema version change for now, or just not use the same organizational structure for those that aren't.

We absolutely would not try to create a script for any pair of format versions, and at this point I don't think we make any promises about which pairs we would write a script for, except that we will write the scripts necessary to migrate the stable NCSA and IDF (so far) repositories. Maybe we could also promise to provide scripts that upgrade along the (linear) path of the daf_butler configuration defaults.

I think there are a few options for how to have some migration content in obs_base (or somewhere else downstream of daf_butler; I'll assume obs_base for simplicity now), but none of them seem super great. We could have a parallel tree of migration scripts there that could extend (via something like inheritance) the daf_butler ones. We could have configuration in the daf_butler ones that need to be populated with strings that resolve to Python classes or configuration resources in obs_base. Or we could have daf_butler just provide library code for organizing migrations and writing migration scripts, with the only real tree of concrete migration scripts existing in obs_base.

## What a Migration Script Is

I think fundamentally we want this to be a Python callable or class that's given only the URI to the data repository and goes from there.

Once we have a Python API, we can wrap it up in whatever command-line tooling we think is useful.

Migration scripts probably need to be configurable sometimes, but we don't want to demand that users provide too much extra information to perform a migration - this is more about being able to demand information from a downstream package like obs_base.

I think we want all migrations to work on a single data repository in-place both because we probably do want to perform small migrations in place, and because out-of-place migrations are probably best expressed by some efficient, backend-specific copy (e.g PostgreSQL database dump and GPFS hard-links) followed by the in-place migration.

Migration scripts should be able to assume that they are the only thing operating on that data repository, and they don't need to work in a way that's guaranteed to be atomic; if necessary, we'll schedule downtime for in order to run them. But being able to work atomically on a live repository without disrupting others is a nice feature to have, and we should have a way to at least document the migrations for which that is the case.

#### Activity

No builds found.
Jim Bosch created issue -
Field Original Value New Value
Link This issue blocks DM-29562 [ DM-29562 ]
 Link This issue blocks DM-29584 [ DM-29584 ]
 Link This issue blocks DM-29585 [ DM-29585 ]
 Link This issue is triggered by DM-29196 [ DM-29196 ]
Hide
Tim Jenness added a comment -

It looks like the UUID code is assuming that raw ingest will not be calculating UUIDs itself but will be asking registry to calculate them from the dataId and possibly run name.

Show
Tim Jenness added a comment - It looks like the UUID code is assuming that raw ingest will not be calculating UUIDs itself but will be asking registry to calculate them from the dataId and possibly run name.
Hide
Tim Jenness added a comment -

From the sounds of it, UUID is needed for lightweight registry work, which means that our test repos at IDF need to be converted to UUID before we can do the panda large-scale tests.

Show
Tim Jenness added a comment - From the sounds of it, UUID is needed for lightweight registry work, which means that our test repos at IDF need to be converted to UUID before we can do the panda large-scale tests.
Hide
Tim Jenness added a comment -

Are we abandoning alembic?

Show
Tim Jenness added a comment - Are we abandoning alembic?
Hide
Andy Salnikov added a comment -

Tim is right, DM-29196 does not allow externally-generated UUIDs, all knowledge about how to make UUID out of necessary dimensions is in manager classes. I though it would be useful to have this hidden from the rest of the world to have better control of what can or cannot be used as ID.

I think alembic could still be useful as a tool for organizing and running migration scripts, implementing that part of functionality ourselves may be possible but we'll probably end up with something similar to alembic. Also we have some trivial migration tool in QServ but it is rather trivial and very limited in what it can. One thing that I would like to see in alembic - is to make it use "external" versions, right now it keeps track of versions in its own separate table, but we also have our own butler_attributes table with essentially the same information.

Show
Andy Salnikov added a comment - Tim is right, DM-29196 does not allow externally-generated UUIDs, all knowledge about how to make UUID out of necessary dimensions is in manager classes. I though it would be useful to have this hidden from the rest of the world to have better control of what can or cannot be used as ID. I think alembic could still be useful as a tool for organizing and running migration scripts, implementing that part of functionality ourselves may be possible but we'll probably end up with something similar to alembic. Also we have some trivial migration tool in QServ but it is rather trivial and very limited in what it can. One thing that I would like to see in alembic - is to make it use "external" versions, right now it keeps track of versions in its own separate table, but we also have our own butler_attributes table with essentially the same information.
Hide
Jim Bosch added a comment - - edited

It looks like the UUID code is assuming that raw ingest will not be calculating UUIDs itself but will be asking registry to calculate them from the dataId and possibly run name.

Tim is right, DM-29196 does not allow externally-generated UUIDs, all knowledge about how to make UUID out of necessary dimensions is in manager classes. I though it would be useful to have this hidden from the rest of the world to have better control of what can or cannot be used as ID.

That's all fine with me, but I think the knowledge about which datasets get the deterministic-UUID treatment is still a downstream-of-butler question.  It sounds like it may make the answer to that question strictly configuration data (e.g. some DatasetType+instrument tuples) rather than logic, though, which may indeed help with how to organize migrations (at least in this case).

From the sounds of it, UUID is needed for lightweight registry work, which means that our test repos at IDF need to be converted to UUID before we can do the panda large-scale tests.

There's no doubt it will make it easier.  But if we need to, we could do this without UUIDs.  The missing pieces are:

• The Butler.import_ logic would have to export its internal mapping of old->new dataset_id values (this is easy) so Nate Lust 's new code could use it (I can't imagine this is that hard).  And while it might not be functionality we'd want to keep forever, I think it's good functionality to have until we have fully deprecated and removed support for all existing integer-dataset_id repos and export files.  That doesn't seem like it's right around the corner.
• We need to write the code that imports the processing-output datasets back into the shared data repository database in a way that supports dataset_id-rewriting.  We don't have code that does this even without dataset_id-rewriting yet, and if we wrote the first version using YAML import/export (something that seems reasonable now, but maybe not a good idea in the long term), we'd get that for free.

In fact, the more I think about it, the more I think we should not block lightweight registry batch work on UUIDs - we don't have to, and I don't want to have to rush the migration work.

I think alembic could still be useful as a tool for organizing and running migration scripts, implementing that part of functionality ourselves may be possible but we'll probably end up with something similar to alembic.

To be clear, I really did mean that I didn't know whether Alembic would be useful, not that I was doubtful it would be.  I'm more than happy to leave this question to Andy; I think it really boils down to how much we have to fight Alembic in order to adapt it to our case; there's no doubt if we went our own way we'd end up reinventing a lot of the same stuff.

Show
Jim Bosch added a comment - - edited It looks like the UUID code is assuming that raw ingest will not be calculating UUIDs itself but will be asking registry to calculate them from the dataId and possibly run name. Tim is right, DM-29196 does not allow externally-generated UUIDs, all knowledge about how to make UUID out of necessary dimensions is in manager classes. I though it would be useful to have this hidden from the rest of the world to have better control of what can or cannot be used as ID. That's all fine with me, but I think the knowledge about which datasets get the deterministic-UUID treatment is still a downstream-of-butler question.  It sounds like it may make the answer to that question strictly configuration data (e.g. some DatasetType+instrument tuples) rather than logic, though, which may indeed help with how to organize migrations (at least in this case). From the sounds of it, UUID is needed for lightweight registry work, which means that our test repos at IDF need to be converted to UUID before we can do the panda large-scale tests. There's no doubt it will make it easier.  But if we need to, we could do this without UUIDs.  The missing pieces are: The Butler.import_ logic would have to export its internal mapping of old->new dataset_id values (this is easy) so Nate Lust 's new code could use it (I can't imagine this is that hard).  And while it might not be functionality we'd want to keep forever, I think it's good functionality to have until we have fully deprecated and removed support for all existing integer- dataset_id repos and export files.  That doesn't seem like it's right around the corner. We need to write the code that imports the processing-output datasets back into the shared data repository database in a way that supports dataset_id -rewriting.  We don't have code that does this even without dataset_id -rewriting yet, and if we wrote the first version using YAML import/export (something that seems reasonable now, but maybe not a good idea in the long term), we'd get that for free. In fact, the more I think about it, the more I think we should not block lightweight registry batch work on UUIDs - we don't have to, and I don't want to have to rush the migration work. I think alembic could still be useful as a tool for organizing and running migration scripts, implementing that part of functionality ourselves may be possible but we'll probably end up with something similar to alembic. To be clear, I really did mean that I didn't know whether Alembic would be useful, not that I was doubtful it would be.  I'm more than happy to leave this question to Andy; I think it really boils down to how much we have to fight Alembic in order to adapt it to our case; there's no doubt if we went our own way we'd end up reinventing a lot of the same stuff.
Hide
Tim Jenness added a comment -

The reason we said that UUID migration blocks lightweight registry is because that's what Nate Lust said to us last week. At the very least he was saying that UUID significantly simplify the code he is working on.

Show
Tim Jenness added a comment - The reason we said that UUID migration blocks lightweight registry is because that's what Nate Lust said to us last week. At the very least he was saying that UUID significantly simplify the code he is working on.
Hide
Tim Jenness added a comment - - edited

In standup today we agreed that on DM-29196 ticket Andy Salnikov would add the ability for integer dataset ids to be preserved on import for empty registries.

Show
Tim Jenness added a comment - - edited In standup today we agreed that on DM-29196 ticket Andy Salnikov would add the ability for integer dataset ids to be preserved on import for empty registries.
Hide
Andy Salnikov added a comment -

Tim Jenness, I work on that on DM-29196 if this is what you mean by "this ticket".

Show
Andy Salnikov added a comment - Tim Jenness , I work on that on DM-29196 if this is what you mean by "this ticket".
 Link This issue relates to DM-29765 [ DM-29765 ]
Hide
Andy Salnikov added a comment -

Jim Bosch, is your concern about DM-29562 being high priority still valid? I think you managed to resolve that issue without any schema change, so we can probably ignore that use case completely for now?

Show
Andy Salnikov added a comment - Jim Bosch , is your concern about DM-29562 being high priority still valid? I think you managed to resolve that issue without any schema change, so we can probably ignore that use case completely for now?
 Status To Do [ 10001 ] In Progress [ 3 ]
Hide
Jim Bosch added a comment -

DM-29562 is indeed a lower priority. New indexes are still a good idea but don't seem to be critical, and I'll make a new ticket for that.

Show
Jim Bosch added a comment - DM-29562 is indeed a lower priority. New indexes are still a good idea but don't seem to be critical, and I'll make a new ticket for that.
Hide
Andy Salnikov added a comment - - edited

Brief summary of our current schema is who controls its definition.

I believe that all schema knowledge is encapsulated into a small bunch of manager classes which are reasonably independent of each other and replaceable. Obviously some dependency between managers is necessary to make references between parts of the schema. I believe that for current implementation this dependencies are reflected as foreign keys in the tables pointing to keys in other tables. This of course means that change in the keys used by one manager needs to be propagated to all dependent managers. Managers are replaceable at the level of Python API, but it does not mean that they have to be compatible at the level of database schema, so replacing a manager in general needs a schema migration. Even if two implementations are 100% compatible w.r.t. database schema, the configuration (butler_attributes table) needs to be updated when switching between manager implementations, I think this implies that butler_attributes contents should be a part of any schema migration, that is minimum possible migration would consist of just changing manager name and/or its version number in butler_attributes table, e.g.:

 --- before -[ RECORD 1 ]-------------------------------------------------------------------------- name | config:registry.managers.attributes value | lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager -[ RECORD 2 ]-------------------------------------------------------------------------- name | version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager value | 1.0.0 -[ RECORD 3 ]-------------------------------------------------------------------------- name | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager value | 664d6a56d87b5ac890308a91a06cd145   --- after -[ RECORD 1 ]-------------------------------------------------------------------------- name | config:registry.managers.attributes value | lsst.daf.butler.registry.attributes.BetterButlerAttributeManager -[ RECORD 2 ]-------------------------------------------------------------------------- name | version:lsst.daf.butler.registry.attributes.BetterButlerAttributeManager value | 0.0.1 -[ RECORD 3 ]-------------------------------------------------------------------------- name | schema_digest:lsst.daf.butler.registry.attributes.BetterButlerAttributeManager value | 79a657af5cf15550e6d1f455ad4dd8c2 

Out current list of managers:

• attributes
• this is supposed to have a single super-stable implementation because is is used to store configured list of managers and their versions
• any potential schema change would probably be handled by a mechanism which is entirely different from schema migration used by other managers
• current schema consists of a single simple table butler_attributes
• collections
• manages collection names and collection relations
• there are two implementations (NameKeyCollectionManager and SynthIntKeyCollectionManager)
• three static tables (same for for each implementation): collection, run, and collection_chain.
• PK type in collection table is different between two implementations
• dimensions
• manages all tables with dimension records and their relations, overlaps, etc.
• there is just one implementation of this manager
• all of that is this is based on dimension configuration
• dimension configuration is also saved in butler_attributes table with name "config:dimensions.json"
• datasets
• manages tables for storing datasets information
• two implementations which differ in the type of the PK/dataset_id column
• there is small number of static tables: dataset_type, dataset; dataset table has FKs to dataset_type and run tables
• there are also dynamic tables created for each unique combination of dimensions with the names like dataset_tags_00000000 and dataset_calibs_00000001; these have FK to tables belonging to dimensions manager
• there are also few dynamic collection_summary_* tables (I think thye are defined for governor dimensions) and collection_summary_dataset_type table
• datastores
• has an "ephemeral" and "monolithic" implementations
• ephemeral keeps things in memory and does not define database tables
• monolithic implementation has two tables: dataset_location and dataset_location_trash, same schema but former has a FK to dataset table, latter has no FK.
• these two table map dataset_id to a datastore name (something like FileDatastore@<butlerRoot>)
• opaque
• single implementation, manages arbitrary tables
• one static table opaque_meta which just lists the names of dynamic tables
• FileDatastore uses this to create table(s) to keep dataset records
• currently there is one table file_datastore_records
• there is no way to associate opaque manager version name with the schema of the dynamic tables
• the best we can do with this manager is to say that its version (and class name) only control static table schema
• if we need to support schema migration of dynamic tables using the same approach we need an additional mechanism to identify version(s) of those dynamic tables.
Show
Andy Salnikov added a comment - - edited Brief summary of our current schema is who controls its definition. I believe that all schema knowledge is encapsulated into a small bunch of manager classes which are reasonably independent of each other and replaceable. Obviously some dependency between managers is necessary to make references between parts of the schema. I believe that for current implementation this dependencies are reflected as foreign keys in the tables pointing to keys in other tables. This of course means that change in the keys used by one manager needs to be propagated to all dependent managers. Managers are replaceable at the level of Python API, but it does not mean that they have to be compatible at the level of database schema, so replacing a manager in general needs a schema migration. Even if two implementations are 100% compatible w.r.t. database schema, the configuration (butler_attributes table) needs to be updated when switching between manager implementations, I think this implies that butler_attributes contents should be a part of any schema migration, that is minimum possible migration would consist of just changing manager name and/or its version number in butler_attributes table, e.g.: --- before -[ RECORD 1 ]-------------------------------------------------------------------------- name | config:registry.managers.attributes value | lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager -[ RECORD 2 ]-------------------------------------------------------------------------- name | version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager value | 1.0.0 -[ RECORD 3 ]-------------------------------------------------------------------------- name | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager value | 664d6a56d87b5ac890308a91a06cd145   --- after -[ RECORD 1 ]-------------------------------------------------------------------------- name | config:registry.managers.attributes value | lsst.daf.butler.registry.attributes.BetterButlerAttributeManager -[ RECORD 2 ]-------------------------------------------------------------------------- name | version:lsst.daf.butler.registry.attributes.BetterButlerAttributeManager value | 0.0.1 -[ RECORD 3 ]-------------------------------------------------------------------------- name | schema_digest:lsst.daf.butler.registry.attributes.BetterButlerAttributeManager value | 79a657af5cf15550e6d1f455ad4dd8c2 Out current list of managers: attributes this is supposed to have a single super-stable implementation because is is used to store configured list of managers and their versions any potential schema change would probably be handled by a mechanism which is entirely different from schema migration used by other managers current schema consists of a single simple table butler_attributes collections manages collection names and collection relations there are two implementations (NameKeyCollectionManager and SynthIntKeyCollectionManager) three static tables (same for for each implementation): collection , run , and collection_chain . PK type in collection table is different between two implementations dimensions manages all tables with dimension records and their relations, overlaps, etc. there is just one implementation of this manager all of that is this is based on dimension configuration dimension configuration is also saved in butler_attributes table with name "config:dimensions.json" datasets manages tables for storing datasets information two implementations which differ in the type of the PK/dataset_id column there is small number of static tables: dataset_type , dataset ; dataset table has FKs to dataset_type and run tables there are also dynamic tables created for each unique combination of dimensions with the names like dataset_tags_00000000 and dataset_calibs_00000001 ; these have FK to tables belonging to dimensions manager there are also few dynamic collection_summary_* tables (I think thye are defined for governor dimensions) and collection_summary_dataset_type table datastores has an "ephemeral" and "monolithic" implementations ephemeral keeps things in memory and does not define database tables monolithic implementation has two tables: dataset_location and dataset_location_trash , same schema but former has a FK to dataset table, latter has no FK. these two table map dataset_id to a datastore name (something like FileDatastore@<butlerRoot> ) opaque single implementation, manages arbitrary tables one static table opaque_meta which just lists the names of dynamic tables FileDatastore uses this to create table(s) to keep dataset records currently there is one table file_datastore_records there is no way to associate opaque manager version name with the schema of the dynamic tables the best we can do with this manager is to say that its version (and class name) only control static table schema if we need to support schema migration of dynamic tables using the same approach we need an additional mechanism to identify version(s) of those dynamic tables.
Hide
Andy Salnikov added a comment -

I think we can reduce management complexity of migrations if we limit the scope of a single migration to a single manager. If we allow more or less independent evolution of managers then we are going to face a combinatorial problem of many possible combinations of manager versions that could exist. Keeping migrations independent should work better just because our managers are also independent of each other. Also identifying migration scripts/steps would be easier in this case, the identification basically becomes the name of the manager and its versions. So for example we could define migration step which performs migration from "version" datasets-ByDimensionsDatasetRecordStorageManager-1.0.0 to datasets-ByDimensionsDatasetRecordStorageManager-1.1.0 or another step which migrates the same "version" to datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0 (from non-UUID to UUID).

With this approach the schema will have multiple versions associated with it, one version per manager, as reflected already in butler_attributes table. I'm not sure that we could actually use those long strings above as versions, it would depend and implementation (alembic or something) but it may be beneficial to have human-readable format (even if it's not strictly ordered) instead of some hash string.

Schema already has dependencies between managers (and their managed schemas) and this dependencies exist at the level of foreign keys in the schema. Certainly when updating one manager's schema that changes type of key or key values it has to also update FKs that reference it. This sort of implies that more than one manager is involved into migration, but in reality it does not have to be so. If we limit one-step schema changes to changes in one manager only and keep other managers intact (excluding FKs of course) then we can write migration script that only need to change single manager version. I can imagine that it is possible to update code in a way that requires changes of schema version for more than one manager but we could forbid that sort of changes by policy decision (even though it may complicate development in rare cases).

Show
Andy Salnikov added a comment - I think we can reduce management complexity of migrations if we limit the scope of a single migration to a single manager. If we allow more or less independent evolution of managers then we are going to face a combinatorial problem of many possible combinations of manager versions that could exist. Keeping migrations independent should work better just because our managers are also independent of each other. Also identifying migration scripts/steps would be easier in this case, the identification basically becomes the name of the manager and its versions. So for example we could define migration step which performs migration from "version" datasets-ByDimensionsDatasetRecordStorageManager-1.0.0 to datasets-ByDimensionsDatasetRecordStorageManager-1.1.0 or another step which migrates the same "version" to datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0 (from non-UUID to UUID). With this approach the schema will have multiple versions associated with it, one version per manager, as reflected already in butler_attributes table. I'm not sure that we could actually use those long strings above as versions, it would depend and implementation (alembic or something) but it may be beneficial to have human-readable format (even if it's not strictly ordered) instead of some hash string. Schema already has dependencies between managers (and their managed schemas) and this dependencies exist at the level of foreign keys in the schema. Certainly when updating one manager's schema that changes type of key or key values it has to also update FKs that reference it. This sort of implies that more than one manager is involved into migration, but in reality it does not have to be so. If we limit one-step schema changes to changes in one manager only and keep other managers intact (excluding FKs of course) then we can write migration script that only need to change single manager version. I can imagine that it is possible to update code in a way that requires changes of schema version for more than one manager but we could forbid that sort of changes by policy decision (even though it may complicate development in rare cases).
Hide
Andy Salnikov added a comment -

Dimensions manager gives me some troubles as I think about it more. Much of its schema and contents is determined by configuration which is more dynamic than I want it to be We could potentially encode configuration changes into a version (patch number) but that is only meaningful if we also manage those numbers centrally, i.e. we require all those version numbers to exist in git repo. That I guess is too complicated for simple use cases when people just want an extra dimension in their standalone repository (but also want to be able to merge other schema changes into that repository). The schema currently has couple of static tables and many more dynamic tables. Schema of the static tables is controlled by the manager (StaticDimensionRecordStorageManager) but schema of dynamic tables and contents of static tables is determined by configuration (and manager version of course).

One possible way out of this is to declare dynamic part of the dimensions schema to be a separately-versioned schema with the schema "number" derived from dimension configuration, could be some sort of stable hash. Complication here is that dynamic schema is also determined by manager version. If we include manager version into dynamic version number then migration script could only be defined for one single manager version. If we don't include manager version into dynamic version then migration script can be used with multiple manager versions but it then has be smart and know what actual manager version it can work with. It might work to some extent if we only include major version number of manager into dynamic version number, then we only need to care about minor version compatibility in the migration scripts.

Show
Andy Salnikov added a comment - Dimensions manager gives me some troubles as I think about it more. Much of its schema and contents is determined by configuration which is more dynamic than I want it to be We could potentially encode configuration changes into a version (patch number) but that is only meaningful if we also manage those numbers centrally, i.e. we require all those version numbers to exist in git repo. That I guess is too complicated for simple use cases when people just want an extra dimension in their standalone repository (but also want to be able to merge other schema changes into that repository). The schema currently has couple of static tables and many more dynamic tables. Schema of the static tables is controlled by the manager (StaticDimensionRecordStorageManager) but schema of dynamic tables and contents of static tables is determined by configuration (and manager version of course). One possible way out of this is to declare dynamic part of the dimensions schema to be a separately-versioned schema with the schema "number" derived from dimension configuration, could be some sort of stable hash. Complication here is that dynamic schema is also determined by manager version. If we include manager version into dynamic version number then migration script could only be defined for one single manager version. If we don't include manager version into dynamic version then migration script can be used with multiple manager versions but it then has be smart and know what actual manager version it can work with. It might work to some extent if we only include major version number of manager into dynamic version number, then we only need to care about minor version compatibility in the migration scripts.
Hide
Andy Salnikov added a comment -

Alternatively for the dynamic dimensions schema we could say that schema migration does not support it at all. Instead we could provide a separate application which allows one to add new dimensions (I imagine we don't want to remove dimensions) to the configuration for the current version of the manager only. I think I like this approach better, but I don't know if this covers adequately all use cases, e.g. how will it work with export/import.

Show
Andy Salnikov added a comment - Alternatively for the dynamic dimensions schema we could say that schema migration does not support it at all. Instead we could provide a separate application which allows one to add new dimensions (I imagine we don't want to remove dimensions) to the configuration for the current version of the manager only. I think I like this approach better, but I don't know if this covers adequately all use cases, e.g. how will it work with export/import.
Hide
Andy Salnikov added a comment -

Thinking about it a bit more and reading alembic docs I think I have more or less complete model that could work. I try to think in general terms but alembic is also what I have in mind because implementation would be simpler if we cam map our model to what alembic can do.

Here are some highlights:

• registry managers will be mostly independent from each other, they will be basic units of version management in this system (sort of separate repository in git-speak); there is no need for a "global" schema version.
• database will keep database-current implementation class name and its version, this is what we already have in butler_attributes table, we may need to reflect those versions to alembic versions if alembic is the tool that we want to use
• different implementations for the same manager will behave more or less like (git) branches, there is a well defined version order within a branch, but branches are independent (and never merged)
• we can provide scripts that switch database from one branch to another for a small subset of versions (this is not the same as git branch merge, more like git checkout, except that you can only do it for particular set of versions on two branches)
• for the same branch it may make sense to provide migration scripts for each new version on that branch migrating it from previous version on the same branch
• possible migration paths will be determined by the configuration and/or migration scripts, more or less like in alembic (or using alembic mechanism), they are not stored in database
• management of the dimensions schema is only limited to what is relevant to the manager version, actual dimension configuration is controlled separately outside of schema management (e.g. application that can add new dimensions)

Some examples of what this could mean:

• datasets manager will map into datasets "unit", that unit has two branches - datasets-ByDimensionsDatasetRecordStorageManager and datasets-ByDimensionsDatasetRecordStorageManagerUUID
• version names on those two branches will look like datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0
• database will only know about one current branch and version
• migration scripts will know about all supported versions and how to migrate/switch between versions and/or branches

I want to avoid direct dependency on alembic command but still use it internally for managing and applying migrations. I'm not sure what package is the best location for scripts and command, I'd propose to have a separate package in case it will need extra dependencies compared to daf_butler.

I want to start by mapping those concepts to alembic to see what is needed to implement it using alembic tools. I want to build separate CLI for that, using butler is probably not the right thing to do for such rarely-used tool (but I can be convinced). I looked superficially at alembic code and it looks like we can't get rid of alembic version table entirely (to avoid duplication with our butler_attributes table) unless we spend some time on redesigning things in alembic and convincing alembic to merge that. Still it may be possible to use the same long names for alembic versions instead of cryptic hashes.

Show
Andy Salnikov added a comment - Thinking about it a bit more and reading alembic docs I think I have more or less complete model that could work. I try to think in general terms but alembic is also what I have in mind because implementation would be simpler if we cam map our model to what alembic can do. Here are some highlights: registry managers will be mostly independent from each other, they will be basic units of version management in this system (sort of separate repository in git-speak); there is no need for a "global" schema version. database will keep database-current implementation class name and its version, this is what we already have in butler_attributes table, we may need to reflect those versions to alembic versions if alembic is the tool that we want to use different implementations for the same manager will behave more or less like (git) branches, there is a well defined version order within a branch, but branches are independent (and never merged) we can provide scripts that switch database from one branch to another for a small subset of versions (this is not the same as git branch merge, more like git checkout, except that you can only do it for particular set of versions on two branches) for the same branch it may make sense to provide migration scripts for each new version on that branch migrating it from previous version on the same branch possible migration paths will be determined by the configuration and/or migration scripts, more or less like in alembic (or using alembic mechanism), they are not stored in database management of the dimensions schema is only limited to what is relevant to the manager version, actual dimension configuration is controlled separately outside of schema management (e.g. application that can add new dimensions) Some examples of what this could mean: datasets manager will map into datasets "unit", that unit has two branches - datasets-ByDimensionsDatasetRecordStorageManager and datasets-ByDimensionsDatasetRecordStorageManagerUUID version names on those two branches will look like datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0 database will only know about one current branch and version migration scripts will know about all supported versions and how to migrate/switch between versions and/or branches I want to avoid direct dependency on alembic command but still use it internally for managing and applying migrations. I'm not sure what package is the best location for scripts and command, I'd propose to have a separate package in case it will need extra dependencies compared to daf_butler. I want to start by mapping those concepts to alembic to see what is needed to implement it using alembic tools. I want to build separate CLI for that, using butler is probably not the right thing to do for such rarely-used tool (but I can be convinced). I looked superficially at alembic code and it looks like we can't get rid of alembic version table entirely (to avoid duplication with our butler_attributes table) unless we spend some time on redesigning things in alembic and convincing alembic to merge that. Still it may be possible to use the same long names for alembic versions instead of cryptic hashes.
Hide
Jim Bosch added a comment -

On opaque table managers:

if we need to support schema migration of dynamic tables using the same approach we need an additional mechanism to identify version(s) of those dynamic tables

At present the only relevant dynamic schema definitions are determined entirely by the Datastore classes used in the repository, so it might make sense to consider these part of a Datastore migration system instead of a Registry one, if we want to keep those separate for migration purposes. I don't think we have any changes to these tables planned for the near future, except for the fact that they do have dataset ID columns that need to be changed to a UUID type for the DM-29196 migration.

I think we can reduce management complexity of migrations if we limit the scope of a single migration to a single manager. If we allow more or less independent evolution of managers then we are going to face a combinatorial problem of many possible combinations of manager versions that could exist.

I agree with this, but I think we should also say that a migration script migrates a particular manager from one version (or class) to another, while holding the others constant at a particular set of classes/versions. In other words, we should not expect a migration script to work on arbitrary (but still held-constant) versions of other managers. And we can copy scripts or call library code as necessary to share code within different migration scripts that only differ in the classes/versions of what is held constant, because they will usually be very similar if not the same.

One possible way out of this is to declare dynamic part of the dimensions schema to be a separately-versioned schema with the schema "number" derived from dimension configuration, could be some sort of stable hash. Complication here is that dynamic schema is also determined by manager version. If we include manager version into dynamic version number then migration script could only be defined for one single manager version. If we don't include manager version into dynamic version then migration script can be used with multiple manager versions but it then has be smart and know what actual manager version it can work with. It might work to some extent if we only include major version number of manager into dynamic version number, then we only need to care about minor version compatibility in the migration scripts.

:+1: to makig the dynamic part of the dimensions schema (based on the stable hash idea) a separately-versioned schema that's a sibling to the managers. I think we should not include the manager version in the dimensions version number; as in my last comment I think all migration scripts need to hold everything else constant particular versions, so a migration that holds the dimension manager constant while evolving only dimension config (or vice versa) is just one case of that.

I suspect most nontrivial migrations of the dimensions manager will be version changes to support new features we would then use in config, so we would migrate the manager first and then migrate the dimensions config in a separate step.

An alternative to using a stable hash might be to have a name in the dimensions config as well as a version number and use those together. Those would behave sort of like manager names and versions in your git/branch analogy.

Alternatively for the dynamic dimensions schema we could say that schema migration does not support it at all. Instead we could provide a separate application which allows one to add new dimensions (I imagine we don't want to remove dimensions) to the configuration for the current version of the manager only. I think I like this approach better, but I don't know if this covers adequately all use cases, e.g. how will it work with export/import.

I like some of this idea as a usual-simple case, because I think most dimension changes will pure additions and those will be easier to deal with. But I worry about relying on it exclusively, as that could leave us in bad shape if we ever need to do more than that. Adding the dimension config version as a new independent contributor to the overall schema version (if we have one) but not trying to design a migration system that includes (yet) it might be a good balance.

Show
Hide
Andy Salnikov added a comment -

Jim, thanks for comments! One aspect of the whole thing is how do we identify migrations. In alembic, and I think in general, each migration script/step is identified by its initial and final revisions, so by definition there could be only one (or zero) script for each pair of initial/final revisions (of a particular manager). Potentially there could be few multi-step migrations between a pair of revisions, I do not know yet how alembic handles that. With that in mind here are few more comments.

we should also say that a migration script migrates a particular manager from one version (or class) to another, while holding the others constant at a particular set of classes/versions. In other words, we should not expect a migration script to work on arbitrary (but still held-constant) versions of other managers.

I do agree that any migration script that updates one manager has to keep versions of other managers intact. For me it also means that particular migration script which depends on schema of other managers needs to support all reasonable versions of those managers, just because there cannot be two migration scripts for a pair of revisions of one manager. I think it could work if we allow updates of the scripts as we update our code. As an example suppose that we created migration script for "datasets" manager from version 1.0.0 (I don't care about manager class here) to 2.0.0 and at that time "collections" manager was at version "2.0.0". "collections" manager went through few minor revisions that did not require change in migration logic, but at version "3.0.0" migration script (for same datasets versions 1.0.0 to 2.0.0) needed to do something extra. In that case we should extend an existing script and add a case that also handles "collections" version 3.0.0 in addition to 2.0.0. I think it may be safer to add some sort of protection to the initial copy of migration script to abort if versions of other managers differ in major version number from the version at which the script was initially written. And I hope that we won't have many of such cases.

As for dimensions - I think I managed to convince myself that doing it with the same mechanism as schema migration is not going to work well. For migration one needs to have both initial and final revisions (version numbers or hashes) plus a migration script. If you as a user just want to add an extra dimension with this approach I guess that the only option to do that is to contact developers, provide your current dimensions config, new dimensions config and then wait when developers will provide you with a migration script. That may also mean that this new migration cannot be a part of centrally-managed migrations because it could be specific to just one repository. I think it would be more user-friendly if we allowed easy addition of dimensions without extra steps, There are of course complications with that, for example what should be done if user makes a mistake and wants to revert the changes, so it may not be so easy after all. Still I think it would be better to not require central coordination (via migration scripts) for that sort of updates.

Show
Hide
Jim Bosch added a comment -

I'm okay with having each script being responsible for "all reasonable versions" of other managers, as long as we can find some way to track which versions of other managers it is expected to or known to work on.  Because these scripts will be rarely and carefully run, just having a convention for humans to record that information in documentation or comments is okay, but it might be worth thinking of a way to make the information a bit more structured and rigorous.

And I'm okay with proceeding with your plan to keep dimensions separate.  If it doesn't work out and we have to include dimensions more centrally later, I don't think it will be any harder to do then, and maybe we will never have to.

Show
Jim Bosch added a comment - I'm okay with having each script being responsible for "all reasonable versions" of other managers, as long as we can find some way to track which versions of other managers it is expected to or known to work on.  Because these scripts will be rarely and carefully run, just having a convention for humans to record that information in documentation or comments is okay, but it might be worth thinking of a way to make the information a bit more structured and rigorous. And I'm okay with proceeding with your plan to keep dimensions separate.  If it doesn't work out and we have to include dimensions more centrally later, I don't think it will be any harder to do then, and maybe we will never have to.
Hide
Andy Salnikov added a comment - - edited

I was messing with Alembic for the last few days, here is a brief summary of what I learned and some ideas for how to organize things using Alembic.

• Alembic has relatively advanced revision management, sort of similar to git with some support with branching and merging.
• The "commit" in Alembic is a migration/script that modifies database schema/contents in arbitrary user-defined ways.
• Simple migrations can be auto-generated from sqlalchemy metadata and current database schema, for more advanced cases (probably 100% of our cases) migrations have to be coded manually.
• Names of revisions in alembic could be reasonably arbitrary strings, if one does not specify new revision name then Alembic generates 12-character hex string (last 12 characters of hex representation of UUID4). User-provided revision names don't allow -, +, or @ in the names.
• Revisions can be associated with "branch labels", which propagate to each revision on a branch, from a branch point to the "head" of a branch. Branch labels can contain characters forbidden in revision names.
• Alembic has some configurable freedom for locations where migration scripts are found, this can be useful for management of unrelated subsets of schema.
• Ordinary Alembic is configured via INI file, but it can also be configured programmatically, there is a reasonable API that allows to hide all operations and configuration so that setup can be tailored to our specific needs.
• Alembic needs to keep current schema revisions in alembic_version table in the database. It is not possible to avoid this feature even if we keep our current versions in other location and can derive alembic revisions from that. (Potentially we could try to convince Alembic developer that it could be generally useful and make PR for that).
• Alembic only keep database-current revision name in the database itself, all history is derived from the relations between migration scripts based on the metadata inside the scripts.

With disjoint independent schemas like we have in our case with different managers there are two options for alembic setup:

• Completely independent Alembic "repositories" with a single revision tree in each (potentially with branches).
• One repository with multiple unrelated trees in it.

Logically both options look identical, but latter option also allows introducing dependencies between revision which may be of potential use for us.

### Organizing migrations

Based on above here are some ideas for organizing our migrations with Alembic.

• I tried to see if using human-readable revision identifiers would help us but the result does not look good, revision names like "datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0" are just too long to be readable. I prefer to use shorter hash of the above names, which makes reproducible revision names (based on UUID5 instead of UUID4 used by Alembic).
• I want to have several independent revision trees, one tree for each manager, with a tree name corresponding to a manager name ("datasets", "collections", etc.) Each tree "root" revision will be a revision representing an empty schema, its revision name will be derived from from a tree/manager name and it will have a branch label associated with it which is a tree name.
• Starting from the root of a tree there will be branches (one or more) corresponding to implementation classes of the manager, the name of the class without module name will be used to label the branch (needs to be prefixed by manager name to keep it unique). Revision names for each revision will be derived from manager name, class name, and version string. For example revision identifier will be generated from a string "datasets-ByDimensionsDatasetRecordStorageManager-1.0.0", and corresponding label for the whole branch will be "datasets-ByDimensionsDatasetRecordStorageManager".

An example of a tree with just one branch for dimensions manager:

  (branch=dimensions) 059cc7b7ef13 --- c2671338a42f --- 4f4f243fea82 --- 2dea9d903dde --- ...  (branch=dimensions-StaticDimensionRecordStorageManager) 

First node above if the root, its name is derived from a string "dimensions". Following revisions correspond to versions 0.1.0, 0.2.0, 1.2.0 and so on for StaticDimensionRecordStorageManager.

An examples of a tree with two branches for datasets manager:

  (branch=datasets-ByDimensionsDatasetRecordStorageManager)  +------- 97325e2242c6 --- d6c57b37f896 --- 08ffc6ecf21f --- 635083325f20  / 059cc7b7ef13 (branch=datasets)  \  +------- 2101fbf51ad3  (branch=datasets-ByDimensionsDatasetRecordStorageManagerUUID) 

Top branch corresponds to versions 0.0.1, 0.0.2, 0.0.3, 1.0.0 of ByDimensionsDatasetRecordStorageManager and bottom branch has one node for ByDimensionsDatasetRecordStorageManagerUUID version 1.0.0.

I believe this can adequately capture linear history of versions for our manager classes. Merging of branches is possible in alembic but I do not think we need that functionality for our use case. It is also possible to specify logical dependency between revisions from different trees, it may become useful for complex migrations.

### One-shot migrations

What is not possible with above approach is the use case that we need to implement first - migration between branches of the same tree. In our case we want to switch from ByDimensionsDatasetRecordStorageManager-1.0.0 to ByDimensionsDatasetRecordStorageManagerUUID-1.0.0. In git language this would actually be a simple checkout of the existing revision, and it really simple to implement by destroying the schema and re-creating it from scratch, but unfortunately we also have data that need to be preserved.

So our starting point is (replacing hashes with meaningful names):

  +------> int-0.0.1 --> int-0.0.2 --> int-0.0.3 --> int-1.0.0  / datasets  \  +------> uuid-1.0.0 

It is not possible to make two different "parents" for the same revision. One option to solve this would be an artificial "merge" which creates new revision uuid-1.0.1 identical to uuid-1.0.0:

  +------> int-0.0.1 --> int-0.0.2 --> int-0.0.3 --> int-1.0.0  / / datasets +------------------------+  \ v  +------> uuid-1.0.0 --> uuid-1.0.1 

but this causes more complications of various kinds. It would be much better to avoid creating artificial revisions.

One solution that I could think of is to create special "one-shot" migration trees that are normally invisible (so they don't confuse Alembic) but can be activated with a special option.

An example of such one-shot migration tree could be (actual case will use hashes instead of names of course):

 one-shot-int-1.0.0-to-uuid-1.0.0 --> int-1.0.0 --> uuid-1.0.0 

The root of the tree is an "empty" migration and it is needed to identify the tree. "Migration" from root to int-1.0.0 is also an empty one, but last step involves actual migration from an existing revision on one branch to an existing revision on another branch, and that does not change any "official" history.

### Implementation

I played with the implementation of the above ideas by creating separate CLI (as butler sub-commands in a separate package). It seems to work at Alembic level reasonably, though I have not produced any actual schema migration yet. Here are some examples of commands that I already have.

Create new revision tree (and root node of that tree):

 butler smig-add-tree [--one-shot] tree-name 

Show existing trees:

 butler smig-trees [-v] 

 butler smig-revision [--one-shot] tree-name ManagerClassName X.Y.Z 

Show revision history:

 butler smig-history [--one-shot] [tree-name] [-v] 

(--one-shot option enables creation and use special one-shot migration trees, tree-name is the manager name).

All above commands work with the history located in migration scripts, which are by default found in $DAF_BUTLER_SMIG_DIR/migrations. Database part of migration process needs to be implemented and tested of course, I think I have reasonable idea for how to do that. And after that I'm going to try and write an actual migration script for UUID migration. I think it makes sense to keep all of this as a separate package, there is a complicated issue with dependencies on other packages and even on daf_butler (which we can discuss later). For now I have it in a separate package daf_butler_smig but we probably need a better name for it. Show Andy Salnikov added a comment - - edited I was messing with Alembic for the last few days, here is a brief summary of what I learned and some ideas for how to organize things using Alembic. General comments on Alembic Alembic has relatively advanced revision management, sort of similar to git with some support with branching and merging. The "commit" in Alembic is a migration/script that modifies database schema/contents in arbitrary user-defined ways. Simple migrations can be auto-generated from sqlalchemy metadata and current database schema, for more advanced cases (probably 100% of our cases) migrations have to be coded manually. Names of revisions in alembic could be reasonably arbitrary strings, if one does not specify new revision name then Alembic generates 12-character hex string (last 12 characters of hex representation of UUID4). User-provided revision names don't allow - , + , or @ in the names. Revisions can be associated with "branch labels", which propagate to each revision on a branch, from a branch point to the "head" of a branch. Branch labels can contain characters forbidden in revision names. Alembic has some configurable freedom for locations where migration scripts are found, this can be useful for management of unrelated subsets of schema. Ordinary Alembic is configured via INI file, but it can also be configured programmatically, there is a reasonable API that allows to hide all operations and configuration so that setup can be tailored to our specific needs. Alembic needs to keep current schema revisions in alembic_version table in the database. It is not possible to avoid this feature even if we keep our current versions in other location and can derive alembic revisions from that. (Potentially we could try to convince Alembic developer that it could be generally useful and make PR for that). Alembic only keep database-current revision name in the database itself, all history is derived from the relations between migration scripts based on the metadata inside the scripts. With disjoint independent schemas like we have in our case with different managers there are two options for alembic setup: Completely independent Alembic "repositories" with a single revision tree in each (potentially with branches). One repository with multiple unrelated trees in it. Logically both options look identical, but latter option also allows introducing dependencies between revision which may be of potential use for us. Organizing migrations Based on above here are some ideas for organizing our migrations with Alembic. I tried to see if using human-readable revision identifiers would help us but the result does not look good, revision names like "datasets-ByDimensionsDatasetRecordStorageManagerUUID-1.0.0" are just too long to be readable. I prefer to use shorter hash of the above names, which makes reproducible revision names (based on UUID5 instead of UUID4 used by Alembic). I want to have several independent revision trees, one tree for each manager, with a tree name corresponding to a manager name ("datasets", "collections", etc.) Each tree "root" revision will be a revision representing an empty schema, its revision name will be derived from from a tree/manager name and it will have a branch label associated with it which is a tree name. Starting from the root of a tree there will be branches (one or more) corresponding to implementation classes of the manager, the name of the class without module name will be used to label the branch (needs to be prefixed by manager name to keep it unique). Revision names for each revision will be derived from manager name, class name, and version string. For example revision identifier will be generated from a string "datasets-ByDimensionsDatasetRecordStorageManager-1.0.0", and corresponding label for the whole branch will be "datasets-ByDimensionsDatasetRecordStorageManager". An example of a tree with just one branch for dimensions manager: (branch=dimensions) 059cc7b7ef13 --- c2671338a42f --- 4f4f243fea82 --- 2dea9d903dde --- ... (branch=dimensions-StaticDimensionRecordStorageManager) First node above if the root, its name is derived from a string "dimensions". Following revisions correspond to versions 0.1.0, 0.2.0, 1.2.0 and so on for StaticDimensionRecordStorageManager. An examples of a tree with two branches for datasets manager: (branch=datasets-ByDimensionsDatasetRecordStorageManager) +------- 97325e2242c6 --- d6c57b37f896 --- 08ffc6ecf21f --- 635083325f20 / 059cc7b7ef13 (branch=datasets) \ +------- 2101fbf51ad3 (branch=datasets-ByDimensionsDatasetRecordStorageManagerUUID) Top branch corresponds to versions 0.0.1, 0.0.2, 0.0.3, 1.0.0 of ByDimensionsDatasetRecordStorageManager and bottom branch has one node for ByDimensionsDatasetRecordStorageManagerUUID version 1.0.0. I believe this can adequately capture linear history of versions for our manager classes. Merging of branches is possible in alembic but I do not think we need that functionality for our use case. It is also possible to specify logical dependency between revisions from different trees, it may become useful for complex migrations. One-shot migrations What is not possible with above approach is the use case that we need to implement first - migration between branches of the same tree. In our case we want to switch from ByDimensionsDatasetRecordStorageManager-1.0.0 to ByDimensionsDatasetRecordStorageManagerUUID-1.0.0 . In git language this would actually be a simple checkout of the existing revision, and it really simple to implement by destroying the schema and re-creating it from scratch, but unfortunately we also have data that need to be preserved. So our starting point is (replacing hashes with meaningful names): +------> int-0.0.1 --> int-0.0.2 --> int-0.0.3 --> int-1.0.0 / datasets \ +------> uuid-1.0.0 It is not possible to make two different "parents" for the same revision. One option to solve this would be an artificial "merge" which creates new revision uuid-1.0.1 identical to uuid-1.0.0: +------> int-0.0.1 --> int-0.0.2 --> int-0.0.3 --> int-1.0.0 / / datasets +------------------------+ \ v +------> uuid-1.0.0 --> uuid-1.0.1 but this causes more complications of various kinds. It would be much better to avoid creating artificial revisions. One solution that I could think of is to create special "one-shot" migration trees that are normally invisible (so they don't confuse Alembic) but can be activated with a special option. An example of such one-shot migration tree could be (actual case will use hashes instead of names of course): one-shot-int-1.0.0-to-uuid-1.0.0 --> int-1.0.0 --> uuid-1.0.0 The root of the tree is an "empty" migration and it is needed to identify the tree. "Migration" from root to int-1.0.0 is also an empty one, but last step involves actual migration from an existing revision on one branch to an existing revision on another branch, and that does not change any "official" history. Implementation I played with the implementation of the above ideas by creating separate CLI (as butler sub-commands in a separate package). It seems to work at Alembic level reasonably, though I have not produced any actual schema migration yet. Here are some examples of commands that I already have. Create new revision tree (and root node of that tree): butler smig-add-tree [--one-shot] tree-name Show existing trees: butler smig-trees [-v] Add new revision: butler smig-revision [--one-shot] tree-name ManagerClassName X.Y.Z Show revision history: butler smig-history [--one-shot] [tree-name] [-v] ( --one-shot option enables creation and use special one-shot migration trees, tree-name is the manager name). All above commands work with the history located in migration scripts, which are by default found in$DAF_BUTLER_SMIG_DIR/migrations . Database part of migration process needs to be implemented and tested of course, I think I have reasonable idea for how to do that. And after that I'm going to try and write an actual migration script for UUID migration. I think it makes sense to keep all of this as a separate package, there is a complicated issue with dependencies on other packages and even on daf_butler (which we can discuss later). For now I have it in a separate package daf_butler_smig but we probably need a better name for it.
Hide
Andy Salnikov added a comment -

I was thinking about whether do we need to make migration scripts aware of particular backend type (sqlite/postgres). In general our schema as expressed in sqlalchemy should be more or less compatible. But there is at least one thing that has an exposure to actual data types used by the backend, it is schema digest that we store in butler_attributes table. What is stored in that table for current schema in SQLite:

 version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |1.0.0 | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |664d6a56d87b5ac890308a91a06cd145 | version:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |6.0.0 | schema_digest:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |83022175a1fbb71edd4f5243a1775146 | version:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |2.0.0 | schema_digest:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |1d45208fb4ad1b51bed29321deb787de | version:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID |1.0.0 | schema_digest:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID|338aa9bda15c2dc82ad04ac55e1b56bc | version:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |0.2.0 | schema_digest:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |79a657af5cf15550e6d1f455ad4dd8c2 | version:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |0.2.0 | schema_digest:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |3558b84d12fa04082ffd6935e0488922 | 

and Postgres:

 version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |1.0.0 | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |664d6a56d87b5ac890308a91a06cd145 | version:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |6.0.0 | schema_digest:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |f98e0c6a8c008e052b78b97c1e546100 | version:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |2.0.0 | schema_digest:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |4e4bcb15f68e9e7a10cb4b01a0a47ec0 | version:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID |1.0.0 | schema_digest:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID|0389bea276b430b9da7330c231a39af7 | version:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |0.2.0 | schema_digest:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |79a657af5cf15550e6d1f455ad4dd8c2 | version:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |0.2.0 | schema_digest:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |71d2aa0e9873e78d51808aa5e09c5bea | 

Digests match for only two managers out of six. So that difference needs to be taken into account when we update butler_attributes contents during upgrade. There might be something else that I miss now of course, we'll discover it as we start writing migrations.

Show
Andy Salnikov added a comment - I was thinking about whether do we need to make migration scripts aware of particular backend type (sqlite/postgres). In general our schema as expressed in sqlalchemy should be more or less compatible. But there is at least one thing that has an exposure to actual data types used by the backend, it is schema digest that we store in butler_attributes table. What is stored in that table for current schema in SQLite: version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |1.0.0 | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |664d6a56d87b5ac890308a91a06cd145 | version:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |6.0.0 | schema_digest:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |83022175a1fbb71edd4f5243a1775146 | version:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |2.0.0 | schema_digest:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |1d45208fb4ad1b51bed29321deb787de | version:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID |1.0.0 | schema_digest:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID|338aa9bda15c2dc82ad04ac55e1b56bc | version:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |0.2.0 | schema_digest:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |79a657af5cf15550e6d1f455ad4dd8c2 | version:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |0.2.0 | schema_digest:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |3558b84d12fa04082ffd6935e0488922 | and Postgres: version:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |1.0.0 | schema_digest:lsst.daf.butler.registry.attributes.DefaultButlerAttributeManager |664d6a56d87b5ac890308a91a06cd145 | version:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |6.0.0 | schema_digest:lsst.daf.butler.registry.dimensions.static.StaticDimensionRecordStorageManager |f98e0c6a8c008e052b78b97c1e546100 | version:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |2.0.0 | schema_digest:lsst.daf.butler.registry.collections.synthIntKey.SynthIntKeyCollectionManager |4e4bcb15f68e9e7a10cb4b01a0a47ec0 | version:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID |1.0.0 | schema_digest:lsst.daf.butler.registry.datasets.byDimensions._manager.ByDimensionsDatasetRecordStorageManagerUUID|0389bea276b430b9da7330c231a39af7 | version:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |0.2.0 | schema_digest:lsst.daf.butler.registry.opaque.ByNameOpaqueTableStorageManager |79a657af5cf15550e6d1f455ad4dd8c2 | version:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |0.2.0 | schema_digest:lsst.daf.butler.registry.bridge.monolithic.MonolithicDatastoreRegistryBridgeManager |71d2aa0e9873e78d51808aa5e09c5bea | Digests match for only two managers out of six. So that difference needs to be taken into account when we update butler_attributes contents during upgrade. There might be something else that I miss now of course, we'll discover it as we start writing migrations.
Hide
Jim Bosch added a comment -

Sorry, haven't caught up completely on Alembic plans, but will try to today.  But sine I saw this flash by: the biggest difference I am aware of between PostgreSQL and SQLite is the timespan representation.

Show
Jim Bosch added a comment - Sorry, haven't caught up completely on Alembic plans, but will try to today.  But sine I saw this flash by: the biggest difference I am aware of between PostgreSQL and SQLite is the timespan representation.
Hide
Jim Bosch added a comment -

Finally caught up.  Only thing I might disagree with here is whether we should reconsider making connection between branches of the same tree a merge.  Because it's our main mechanism for preserving backwards compatibility, I think that may be our most common kind of migration, and it seems a shame to make it such a special case.  I also wonder a bit whether that extra "merge commit" might at least sometimes play a useful role, e.g. keep around the original versions of some information in a different table if it isn't possible to reconstruct it from the new version of the information.  And if we hash merge commits differently from the non-merge commits, perhaps our own version system would still be able to treat "merge" revisions as identical to their closest non-merge parent?

Show
Jim Bosch added a comment - Finally caught up.  Only thing I might disagree with here is whether we should reconsider making connection between branches of the same tree a merge.  Because it's our main mechanism for preserving backwards compatibility, I think that may be our most common kind of migration, and it seems a shame to make it such a special case.  I also wonder a bit whether that extra "merge commit" might at least sometimes play a useful role, e.g. keep around the original versions of some information in a different table if it isn't possible to reconstruct it from the new version of the information.  And if we hash merge commits differently from the non-merge commits, perhaps our own version system would still be able to treat "merge" revisions as identical to their closest non-merge parent?
 Link This issue is triggered by RFC-779 [ RFC-779 ]
Hide
Andy Salnikov added a comment -

I was experimenting/thinking more on the branches and how alembic manages them. I believe that we cannot do merges to switch between branches. The issue here is that our branches are exclusive, at a particular moment the database schema can only be on one branch or another. Merge OTOH requires that both parent revisions exist in the database. When I drew the graph above I forgot about the distinction between alembic history and database state, while history can have arbitrary transitions, the revisions that may exist in the database can limit that freedom.

I keep thinking about other possibilities, but I believe that switching between branches is very special and does not really map onto branching/merging (again it is more similar to "git checkout" than "git merge"). Having multiple alembic revisions for one registry schema version is problematic and complicates management because those multiple revisions cannot be merged so we would need to track all of them separately (alembic does not have migrations that can apply to more than one revision).

I also posted a question to alembic forum: https://groups.google.com/g/sqlalchemy-alembic/c/W-gJGuKyHZY, maybe there will be some suggestions there.

Show
Andy Salnikov added a comment - I was experimenting/thinking more on the branches and how alembic manages them. I believe that we cannot do merges to switch between branches. The issue here is that our branches are exclusive, at a particular moment the database schema can only be on one branch or another. Merge OTOH requires that both parent revisions exist in the database. When I drew the graph above I forgot about the distinction between alembic history and database state, while history can have arbitrary transitions, the revisions that may exist in the database can limit that freedom. I keep thinking about other possibilities, but I believe that switching between branches is very special and does not really map onto branching/merging (again it is more similar to "git checkout" than "git merge"). Having multiple alembic revisions for one registry schema version is problematic and complicates management because those multiple revisions cannot be merged so we would need to track all of them separately (alembic does not have migrations that can apply to more than one revision). I also posted a question to alembic forum: https://groups.google.com/g/sqlalchemy-alembic/c/W-gJGuKyHZY , maybe there will be some suggestions there.
 Link This issue is triggered by RFC-777 [ RFC-777 ]
 Link This issue is triggered by RFC-779 [ RFC-779 ]
Hide
Andy Salnikov added a comment - - edited

I think I have an initial implementation of my ideas for abusing alembic features to do what we need. Clearly alembic does not support multi-path migrations between existing revisions, and any workarounds involving branching/merging does not look manageable. The solution that I use now by hiding parts of the trees feels a bit hackish but it is cleaner than other options that I could think of. I can probably do a small presentation if people are interested about how it is done.

Anyways, right now I'm thinking about our immediate problem - migrating datasets from integer IDs to UUIDs and how to express that in terms of alembic/sqlalchemy. The simple strategy that I wanted to explore:

• Add new column to dataset table with a temporary name (e.g. uuid_id), with type matching UUID type, non-indexed, with default NULL.
• Fill values in that column based on all other info, this will involve reading bunch of other tables.
• In all dependent tables create new column, fill it with values matching content of dataset table (dependent tables are dataset_calibs_*, dataset_tags_*, dataset_location, dataset_location_trash, file_datastore_records)
• Remove FK constraints from all dependent tables, remember the names of constraints, drop dataset_id columns.
• Remove PK from dataset table, drop id column. Rename uuid_id column to id, add PK constraint with the same name as before.
• For each dependent table rename new column to dataset_id, add FK constraint.

This might work except that it will likely take enormous time. The number of records in all tables is probably at the scale of 100 million (in current DC2 database). using simple UPDATE on each record separately to fill its UUID value could probably take several days. I believe we have to do it differently via bulk insert (by bulk I mean from external CSV file). Here is probably a better strategy:

• dump all tables as CSV files to disk
• generate dataset_id -> UUID
• transform all CSV files replacing integer id with UUID
• rename existing tables
• create new empty tables (possibly without indices originally)
• import all CSV data into new tables
• do some basic checks
• drop old tables.

Bulk dump/insert is of course backend-specific operation so there will be code that needs to know about which backend is in use. Testing all of that is not easy too.

Show
Andy Salnikov added a comment - - edited I think I have an initial implementation of my ideas for abusing alembic features to do what we need. Clearly alembic does not support multi-path migrations between existing revisions, and any workarounds involving branching/merging does not look manageable. The solution that I use now by hiding parts of the trees feels a bit hackish but it is cleaner than other options that I could think of. I can probably do a small presentation if people are interested about how it is done. Anyways, right now I'm thinking about our immediate problem - migrating datasets from integer IDs to UUIDs and how to express that in terms of alembic/sqlalchemy. The simple strategy that I wanted to explore: Add new column to dataset table with a temporary name (e.g. uuid_id), with type matching UUID type, non-indexed, with default NULL. Fill values in that column based on all other info, this will involve reading bunch of other tables. In all dependent tables create new column, fill it with values matching content of dataset table (dependent tables are dataset_calibs_*, dataset_tags_*, dataset_location, dataset_location_trash, file_datastore_records) Remove FK constraints from all dependent tables, remember the names of constraints, drop dataset_id columns. Remove PK from dataset table, drop id column. Rename uuid_id column to id, add PK constraint with the same name as before. For each dependent table rename new column to dataset_id, add FK constraint. This might work except that it will likely take enormous time. The number of records in all tables is probably at the scale of 100 million (in current DC2 database). using simple UPDATE on each record separately to fill its UUID value could probably take several days. I believe we have to do it differently via bulk insert (by bulk I mean from external CSV file). Here is probably a better strategy: dump all tables as CSV files to disk generate dataset_id -> UUID transform all CSV files replacing integer id with UUID rename existing tables create new empty tables (possibly without indices originally) import all CSV data into new tables add indices do some basic checks drop old tables. Bulk dump/insert is of course backend-specific operation so there will be code that needs to know about which backend is in use. Testing all of that is not easy too.
Hide
Jim Bosch added a comment - - edited

• Create a throwaway table with just integer dataset_id and UUID columns, populated by INSERTs emitted by Python code
• Create a new schema with the UUID configuration
• Do a lot of INSERT INTO <new_schema>.<table> (...) FROM SELECT ... FROM <old_schema>.<table> JOIN id_mapping_table

?

It might still take a day or two (guessing from how long it took to do the original INSERTs), but if it's a separate schema we don't need downtime, and we can just swap to the new one when it's all done.

Show
Jim Bosch added a comment - - edited What about: Create a throwaway table with just integer dataset_id and UUID columns, populated by INSERTs emitted by Python code Create a new schema with the UUID configuration Do a lot of INSERT INTO <new_schema>.<table> (...) FROM SELECT ... FROM <old_schema>.<table> JOIN id_mapping_table ? It might still take a day or two (guessing from how long it took to do the original INSERTs), but if it's a separate schema we don't need downtime, and we can just swap to the new one when it's all done.
Hide
Andy Salnikov added a comment -

I am worried about JOIN performance on 30M-record tables. I guess some reasonable scale tests is needed to understand what we can expect from that. I think downtime is needed in any case, otherwise we may miss new records that can be inserted while we are doing migration.

Show
Andy Salnikov added a comment - I am worried about JOIN performance on 30M-record tables. I guess some reasonable scale tests is needed to understand what we can expect from that. I think downtime is needed in any case, otherwise we may miss new records that can be inserted while we are doing migration.
Hide
Jim Bosch added a comment - - edited

We might be able to reduce downtime by doing one migration up-front while the repository is in use, and then patching up during downtime by filtering on the ingest timestamp column relative to the last dataset we handled in the first pass (but that would complicate the INSERT queries for the second pass).

Show
Jim Bosch added a comment - - edited We might be able to reduce downtime by doing one migration up-front while the repository is in use, and then patching up during downtime by filtering on the ingest timestamp column relative to the last dataset we handled in the first pass (but that would complicate the INSERT queries for the second pass).
Hide
Andy Salnikov added a comment -

But we also need to track datasets that were removed, don't we? That I think again implies a sort of full scan and comparison to other table.

Show
Andy Salnikov added a comment - But we also need to track datasets that were removed, don't we? That I think again implies a sort of full scan and comparison to other table.
Hide
Andy Salnikov added a comment -

As we agreed the result of this ticket will be a technote with a proposed solution reviewed by Kian-Tat Lim. PR is here: https://github.com/lsst-dm/dmtn-191/pull/1 and formatted document is at https://dmtn-191.lsst.io/v/DM-29593/index.html. Everyone is welcome to read and comment of course. We may need better name of a package (if we decide that this can work in general).

Show
Andy Salnikov added a comment - As we agreed the result of this ticket will be a technote with a proposed solution reviewed by Kian-Tat Lim . PR is here: https://github.com/lsst-dm/dmtn-191/pull/1 and formatted document is at https://dmtn-191.lsst.io/v/DM-29593/index.html . Everyone is welcome to read and comment of course. We may need better name of a package (if we decide that this can work in general).
 Reviewers Kian-Tat Lim [ ktl ] Status In Progress [ 3 ] In Review [ 10004 ]
 Link This issue is triggering DM-30186 [ DM-30186 ]
 Link This issue blocks DM-30186 [ DM-30186 ]
Hide
Andy Salnikov added a comment -

Kian-Tat Lim, pinging you about whether we  can merge this ticket for now (it is blocking DM-30186 merge) and make a new ticket if we decide later that we should do it differently?

Show
Andy Salnikov added a comment - Kian-Tat Lim , pinging you about whether we  can merge this ticket for now (it is blocking DM-30186 merge) and make a new ticket if we decide later that we should do it differently?
Hide
Kian-Tat Lim added a comment -

Sorry, I didn't record that we agreed at the meeting that this was OK to move forward as is, but I will attempt to diagram an alternate solution that might be more "alembic-aligned".

Show
Kian-Tat Lim added a comment - Sorry, I didn't record that we agreed at the meeting that this was OK to move forward as is, but I will attempt to diagram an alternate solution that might be more "alembic-aligned".
 Status In Review [ 10004 ] Reviewed [ 10101 ]
Hide
Andy Salnikov added a comment -

Thanks for review, merged technote (and daf_butler_migrate but most of work on it is done on DM-30186 so I did not ask for review here).

Show
Andy Salnikov added a comment - Thanks for review, merged technote (and daf_butler_migrate but most of work on it is done on DM-30186 so I did not ask for review here).
 Resolution Done [ 10000 ] Status Reviewed [ 10101 ] Done [ 10002 ]

#### People

Assignee:
Andy Salnikov
Reporter:
Jim Bosch
Reviewers:
Kian-Tat Lim
Watchers:
Andy Salnikov, Jim Bosch, Kian-Tat Lim, Tim Jenness