Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24803

Plan for Gen3 schema migrations using Alembic

    XMLWordPrintable

    Details

    • Story Points:
      10
    • Sprint:
      DB_F20_06
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      As part of the plan for supporting schema migrations we would like to investigate the possibility of using Alembic. Christopher Stephens [X] has done a preliminary investigation and reports a positive experience. Now we need to see what is involved in modifying daf_butler to properly support this.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            I have done some reading on Alembic, here is a quick summary of the things that I thought important (and the docs are here: https://alembic.sqlalchemy.org/en/latest/index.html):

            • It is quite advanced and I think it should support all reasonable changes to the schema that regular projects need
            • Schema migration process is a sequence of upgrade (or downgrade) steps each implemented as a separate Python method
            • Those methods are frequently implemented by hand using a mix of alembic + sqlalchemy methods
            • There is also auto-generation of migration scripts based on the diff between existing database schema and MetaData schema in memory, auto-generation can work for many simple cases but some things cannot always be detected or need some help (https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect)
            • Data migration is also supported (inserting/updating table contents) but obviously need to be added by hand to the upgrade script
            • Alembic does not have version numbers, instead it uses random hex-strings as revision identifiers, the ordering of migration is specified explicitly in migration scripts
            • In simplest cases the migrations are applied linearly, one after another
            • Alembic also supports more complex scenarios:
              • branching (somewhat similar to git branches) where you can apply two different migrations to the same revision and then later possibly merge them (and I think it is still considered beta feature)
              • multiple bases (a.k.a. forest) where more or less independent parts of the database can be migrated independently of each others (though I think you can also specify dependencies between revisions in that case)
            • Alembic supports online and offline upgrades, in online mode it just runs on a live database, in offline mode it generates SQL script that can be applied to database later
            • From what I understood complex SQLite migration may need special "batch" mode due to poor support of ALTER command in SQLite.

            In general I think this is a reasonably powerful tool that we could use in our environment if we need this sort of tool. It is closely related to SQLAlchemy and documentation is about the same state of mess as in SQLAlchemy.

            Show
            salnikov Andy Salnikov added a comment - I have done some reading on Alembic, here is a quick summary of the things that I thought important (and the docs are here: https://alembic.sqlalchemy.org/en/latest/index.html): It is quite advanced and I think it should support all reasonable changes to the schema that regular projects need Schema migration process is a sequence of upgrade (or downgrade) steps each implemented as a separate Python method Those methods are frequently implemented by hand using a mix of alembic + sqlalchemy methods There is also auto-generation of migration scripts based on the diff between existing database schema and MetaData schema in memory, auto-generation can work for many simple cases but some things cannot always be detected or need some help ( https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect ) Data migration is also supported (inserting/updating table contents) but obviously need to be added by hand to the upgrade script Alembic does not have version numbers, instead it uses random hex-strings as revision identifiers, the ordering of migration is specified explicitly in migration scripts In simplest cases the migrations are applied linearly, one after another Alembic also supports more complex scenarios: branching (somewhat similar to git branches) where you can apply two different migrations to the same revision and then later possibly merge them (and I think it is still considered beta feature) multiple bases (a.k.a. forest) where more or less independent parts of the database can be migrated independently of each others (though I think you can also specify dependencies between revisions in that case) Alembic supports online and offline upgrades, in online mode it just runs on a live database, in offline mode it generates SQL script that can be applied to database later From what I understood complex SQLite migration may need special "batch" mode due to poor support of ALTER command in SQLite. In general I think this is a reasonably powerful tool that we could use in our environment if we need this sort of tool. It is closely related to SQLAlchemy and documentation is about the same state of mess as in SQLAlchemy.
            Hide
            salnikov Andy Salnikov added a comment -

            Some obvious things to say about (schema) versioning in general:

            • I think it is obvious that Registry schema will go through more than one version over its lifetime. It would help to try to design initial version as flexible as possible but at some point it will have to change.
            • Schema version is of course tied to client side software version and version compatibility is probably the biggest issue in the whole picture.
            • It is not possible to always keep things fully backward- and for forward-compatible, in many cases people sacrifice forward compatibility for the need of the new features of fixes (which basically means that after we upgrade schema we all have to switch to new software version, or alternatively we cannot make schema upgrade until there is some older SW version still in use).
            • With the project as complex as LSST I can imagine that we always have several releases in use by different groups/sub-projects so that issue of compatibility is even more important. One possible way to remedy this issue is to keep several versions of database, one for each corresponding SW release and plan their management and upgrade path accordingly.
            • With some (probably significant) effort it may be possible for some SW version to support several recent schema version if this is desired.
            • Related to that is the idea that SW need to know the version of the schema in the database, either to report that database version is not compatible or to select a version of client code that can work with that schema, so clients should be able to identify schema revision in the database.
            • Alembic revision "numbers" are not super-useful for identification, they are random strings and potentially there can be several random strings (in case of branches or multiple bases). It may be beneficial for Registry to keep more user-friendly version (number) as metadata that can be queried (or several versions for different identifiable parts of the Registry schema).

            I think it would also be helpful to try to identify one or more use cases for possible future schema migration to understand what options we should or should not support.

            Show
            salnikov Andy Salnikov added a comment - Some obvious things to say about (schema) versioning in general: I think it is obvious that Registry schema will go through more than one version over its lifetime. It would help to try to design initial version as flexible as possible but at some point it will have to change. Schema version is of course tied to client side software version and version compatibility is probably the biggest issue in the whole picture. It is not possible to always keep things fully backward- and for forward-compatible, in many cases people sacrifice forward compatibility for the need of the new features of fixes (which basically means that after we upgrade schema we all have to switch to new software version, or alternatively we cannot make schema upgrade until there is some older SW version still in use). With the project as complex as LSST I can imagine that we always have several releases in use by different groups/sub-projects so that issue of compatibility is even more important. One possible way to remedy this issue is to keep several versions of database, one for each corresponding SW release and plan their management and upgrade path accordingly. With some (probably significant) effort it may be possible for some SW version to support several recent schema version if this is desired. Related to that is the idea that SW need to know the version of the schema in the database, either to report that database version is not compatible or to select a version of client code that can work with that schema, so clients should be able to identify schema revision in the database. Alembic revision "numbers" are not super-useful for identification, they are random strings and potentially there can be several random strings (in case of branches or multiple bases). It may be beneficial for Registry to keep more user-friendly version (number) as metadata that can be queried (or several versions for different identifiable parts of the Registry schema). I think it would also be helpful to try to identify one or more use cases for possible future schema migration to understand what options we should or should not support.
            Hide
            salnikov Andy Salnikov added a comment -

            Few observations from my simple tests with alembic.

            Building current schema

            Alembic supports different strategies for building latest version of the schema:

            • start with empty schema and apply all migration steps in sequence, this is sort of conventional way for the projects that deploy alembic as their main tool for schema definition
            • use different tool to generate current/latest schema and tell alembic the revision of that schema

            As we build our schema in daf_butler then latter method is what we need to use if we are to use alembic. In that case schema revision will still have to be defined in alembic (as a regular migration from preceding revision). When completely new database is created with butler tools its release is not known to alembic, one needs to run alembic stamp <revision> command line tool to store revision number in database, or alternatively use alembic Python API to do the same.

            Revision "numbers"

            By default alembic generates random string for revision when it creates new migration, but it is also possible to specify user-defined revision name instead. Potentially we could use daf_butler schema version number (whatever it is) as alembic revision if this could simplify things. Database-current alembic revision(s) is/are stored in the alembic_version table in the database (can contain more than one record in case of branches/forest):

            CREATE TABLE alembic_version (
                    version_num VARCHAR(32) NOT NULL,
                    CONSTRAINT alembic_version_pkc PRIMARY KEY (version_num)
            );
            

            Revisions appear in the migration scripts each script contains its revision name and its parent revision name(s). Alembic need to look at those revisions to build full migration path.

            Auto-generated migrations

            For simple cases alembic can generate migration script based on new schema (sqlalchemy.Metadata instance) and current schema in database file. As a trivial exercise I build ci_hsc_gen3 database from w.2020.19 and ran auto-generation tool on that. To get current (w.2020.22) schema I had to add few lines of code to alembic/env.py script:

            from lsst.daf.butler import Butler
             
            def _make_metadata():
                """Build sqlalchemy metadata for current registry schema.
                """
                config = os.path.expandvars("$CI_HSC_GEN3_DIR/DATA/butler.yaml")
                butler = Butler(config)
                return butler.registry._db._metadata
             
            target_metadata = _make_metadata()
            

            And then executed:

            $ alembic revision --rev-id w.2020.22 -m 'migration from w.2020.19 to w.2020.22' --autogenerate
            INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
            INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
            INFO  [alembic.autogenerate.compare] Detected removed index 'dataset_composition_fkidx_component_dataset_id' on 'dataset_composition'
            INFO  [alembic.autogenerate.compare] Detected removed index 'dataset_composition_fkidx_parent_dataset_id' on 'dataset_composition'
            INFO  [alembic.autogenerate.compare] Detected removed table 'dataset_composition'
              Generating /project/salnikov/gen3-middleware/daf_butler/alembic/versions/w.2020.22_migration_from_w_2020_19_to_w_2020_22.py ...  done
            

            Generated migration script alembic/versions/w.2020.22_migration_from_w_2020_19_to_w_2020_22.py looks like:

            """migration from w.2020.19 to w.2020.22
             
            Revision ID: w.2020.22
            Revises: 
            Create Date: 2020-06-02 17:05:13.335899
             
            """
            from alembic import op
            import sqlalchemy as sa
             
             
            # revision identifiers, used by Alembic.
            revision = 'w.2020.22'
            down_revision = None
            branch_labels = None
            depends_on = None
             
             
            def upgrade():
                # ### commands auto generated by Alembic - please adjust! ###
                op.drop_index('dataset_composition_fkidx_component_dataset_id', table_name='dataset_composition')
                op.drop_index('dataset_composition_fkidx_parent_dataset_id', table_name='dataset_composition')
                op.drop_table('dataset_composition')
                # ### end Alembic commands ###
             
             
            def downgrade():
                # ### commands auto generated by Alembic - please adjust! ###
                op.create_table('dataset_composition',
                sa.Column('component_name', sa.VARCHAR(length=32), nullable=False),
                sa.Column('simple', sa.BOOLEAN(), nullable=True),
                sa.Column('parent_dataset_id', sa.BIGINT(), nullable=False),
                sa.Column('component_dataset_id', sa.BIGINT(), nullable=True),
                sa.CheckConstraint('length(component_name)<=32', name='dataset_composition_len_component_name'),
                sa.CheckConstraint('simple IN (0, 1)'),
                sa.ForeignKeyConstraint(['component_dataset_id'], ['dataset.id'], name='fkey_dataset_composition_dataset_id_component_dataset_id', ondelete='CASCADE'),
                sa.ForeignKeyConstraint(['parent_dataset_id'], ['dataset.id'], name='fkey_dataset_composition_dataset_id_parent_dataset_id', ondelete='CASCADE'),
                sa.PrimaryKeyConstraint('component_name', 'parent_dataset_id')
                )
                op.create_index('dataset_composition_fkidx_parent_dataset_id', 'dataset_composition', ['parent_dataset_id'], unique=False)
                op.create_index('dataset_composition_fkidx_component_dataset_id', 'dataset_composition', ['component_dataset_id'], unique=False)
                # ### end Alembic commands ###
            

            Which looks simple, but for more complicated cases it certainly needs review and modifications (auto-generation cannot handle all possible changes). Also data migration has to be done manually of course.

            General comment - alembic is a migration tool, not a schema definition tool. It does not know what is the complete schema definition is for any revision, and it cannot verify that schema in database corresponds to a particular revision, it blindly relies on a consistency of the revision defined in alembic_version table and revisions in migration scripts. Obviously it needs a lot of attention when managing those, e.g. executing alembic stamp should be run only after checking it 3 times and then once more.

            Show
            salnikov Andy Salnikov added a comment - Few observations from my simple tests with alembic. Building current schema Alembic supports different strategies for building latest version of the schema: start with empty schema and apply all migration steps in sequence, this is sort of conventional way for the projects that deploy alembic as their main tool for schema definition use different tool to generate current/latest schema and tell alembic the revision of that schema As we build our schema in daf_butler then latter method is what we need to use if we are to use alembic. In that case schema revision will still have to be defined in alembic (as a regular migration from preceding revision). When completely new database is created with butler tools its release is not known to alembic, one needs to run alembic stamp <revision> command line tool to store revision number in database, or alternatively use alembic Python API to do the same. Revision "numbers" By default alembic generates random string for revision when it creates new migration, but it is also possible to specify user-defined revision name instead. Potentially we could use daf_butler schema version number (whatever it is) as alembic revision if this could simplify things. Database-current alembic revision(s) is/are stored in the alembic_version table in the database (can contain more than one record in case of branches/forest): CREATE TABLE alembic_version ( version_num VARCHAR (32) NOT NULL , CONSTRAINT alembic_version_pkc PRIMARY KEY (version_num) ); Revisions appear in the migration scripts each script contains its revision name and its parent revision name(s). Alembic need to look at those revisions to build full migration path. Auto-generated migrations For simple cases alembic can generate migration script based on new schema (sqlalchemy.Metadata instance) and current schema in database file. As a trivial exercise I build ci_hsc_gen3 database from w.2020.19 and ran auto-generation tool on that. To get current (w.2020.22) schema I had to add few lines of code to alembic/env.py script: from lsst.daf.butler import Butler   def _make_metadata(): """Build sqlalchemy metadata for current registry schema. """ config = os.path.expandvars( "$CI_HSC_GEN3_DIR/DATA/butler.yaml" ) butler = Butler(config) return butler.registry._db._metadata   target_metadata = _make_metadata() And then executed: $ alembic revision --rev-id w.2020.22 -m 'migration from w.2020.19 to w.2020.22' --autogenerate INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. INFO [alembic.autogenerate.compare] Detected removed index 'dataset_composition_fkidx_component_dataset_id' on 'dataset_composition' INFO [alembic.autogenerate.compare] Detected removed index 'dataset_composition_fkidx_parent_dataset_id' on 'dataset_composition' INFO [alembic.autogenerate.compare] Detected removed table 'dataset_composition' Generating /project/salnikov/gen3-middleware/daf_butler/alembic/versions/w.2020.22_migration_from_w_2020_19_to_w_2020_22.py ... done Generated migration script alembic/versions/w.2020.22_migration_from_w_2020_19_to_w_2020_22.py looks like: """migration from w. 2020.19 to w. 2020.22   Revision ID : w. 2020.22 Revises: Create Date: 2020 - 06 - 02 17 : 05 : 13.335899   """ from alembic import op import sqlalchemy as sa     # revision identifiers, used by Alembic. revision = 'w.2020.22' down_revision = None branch_labels = None depends_on = None     def upgrade(): # ### commands auto generated by Alembic - please adjust! ### op.drop_index( 'dataset_composition_fkidx_component_dataset_id' , table_name = 'dataset_composition' ) op.drop_index( 'dataset_composition_fkidx_parent_dataset_id' , table_name = 'dataset_composition' ) op.drop_table( 'dataset_composition' ) # ### end Alembic commands ###     def downgrade(): # ### commands auto generated by Alembic - please adjust! ### op.create_table( 'dataset_composition' , sa.Column( 'component_name' , sa.VARCHAR(length = 32 ), nullable = False ), sa.Column( 'simple' , sa.BOOLEAN(), nullable = True ), sa.Column( 'parent_dataset_id' , sa.BIGINT(), nullable = False ), sa.Column( 'component_dataset_id' , sa.BIGINT(), nullable = True ), sa.CheckConstraint( 'length(component_name)<=32' , name = 'dataset_composition_len_component_name' ), sa.CheckConstraint( 'simple IN (0, 1)' ), sa.ForeignKeyConstraint([ 'component_dataset_id' ], [ 'dataset.id' ], name = 'fkey_dataset_composition_dataset_id_component_dataset_id' , ondelete = 'CASCADE' ), sa.ForeignKeyConstraint([ 'parent_dataset_id' ], [ 'dataset.id' ], name = 'fkey_dataset_composition_dataset_id_parent_dataset_id' , ondelete = 'CASCADE' ), sa.PrimaryKeyConstraint( 'component_name' , 'parent_dataset_id' ) ) op.create_index( 'dataset_composition_fkidx_parent_dataset_id' , 'dataset_composition' , [ 'parent_dataset_id' ], unique = False ) op.create_index( 'dataset_composition_fkidx_component_dataset_id' , 'dataset_composition' , [ 'component_dataset_id' ], unique = False ) # ### end Alembic commands ### Which looks simple, but for more complicated cases it certainly needs review and modifications (auto-generation cannot handle all possible changes). Also data migration has to be done manually of course. General comment - alembic is a migration tool, not a schema definition tool. It does not know what is the complete schema definition is for any revision, and it cannot verify that schema in database corresponds to a particular revision, it blindly relies on a consistency of the revision defined in alembic_version table and revisions in migration scripts. Obviously it needs a lot of attention when managing those, e.g. executing alembic stamp should be run only after checking it 3 times and then once more.
            Hide
            salnikov Andy Salnikov added a comment -

            Now for the "plan" part of this ticket. If we would want to use Alembic as a tool to manage schema/data migration part of the whole migration process, here is what I think the process could look like:

            • When there is a new version of the schema in daf_butler with the the new version number/revision (potentially schema can have multiple versions, e.g. core version + dimensions version, this is not critical) and before this new daf_butler is deployed we need to prepare alembic migration for that
            • For each new daf_butler schema version there should be corresponding alembic revision, naming of that revision can be either random or we can decide to encode daf_butler schema version into a alembic revision. Latter can work if daf_butler versioning is relatively simple, probably couple of versions can be encoded in revision as "core-1.2.3+dim-3.2.1"
            • Potentially we can think of separating management of different parts of schema (like core and dimensions) into separate alembic branches/forest. Branching is still considered beta stage in alembic so it may need some experimenting to decide if it can work.
            • New revision in alembic can be created either manually or auto-generated based on previous version of schema in a database and new Metadata instance generated by the code. Auto-generation does not always work, I expect in many cases it will need some adjusting or at least a revision. If data migration is needed it needs to be added to alembic revision manually.
            • daf_butler is going to keep its version(s) in a special metadata table, and alembic has its own alembic_version table. I think there is a good reason to keep these separate (i.e. don't make daf_butler depend on alembic). Theoretically it may also be possible to teach alembic to construct its revision number from daf_butler metadata table, though looking at the docs I cannot say if it can be done now.
            • If daf_butler metadata and alembic version tables are separate then an additional step is needed when creating completely new database using daf_butler tools - it is better to add alembic version table at the same instant with the matched alembic revision number (e.g. using alembic stamp command).
            • Also if daf_butler metadata table is separate from alembic version then alembic migration script will need to update metadata versions for daf_butler (do data migration on metadata table).
            • Migration of existing databases needs to be synched with the deployment of the new daf_butler releases. Migration can be run in either online of offline mode:
              • in online mode alembic runs against live database, alembic configuration has to specify sqlalchemy connection string for database
              • in offline mode alembic generates SQL that can be executed later on a database, but I think it still needs a connection to a database to extract its current revision (it may be useful to understand whether it can run completely offline by specifying both current revision and sqlalchemy dialect, command line tool does not allow that but API could support it).

            In general I think above "plan" is more or less abstract, something very similar would have to be done for any other tool we may decide to use or build for schema management.

            Show
            salnikov Andy Salnikov added a comment - Now for the "plan" part of this ticket. If we would want to use Alembic as a tool to manage schema/data migration part of the whole migration process, here is what I think the process could look like: When there is a new version of the schema in daf_butler with the the new version number/revision (potentially schema can have multiple versions, e.g. core version + dimensions version, this is not critical) and before this new daf_butler is deployed we need to prepare alembic migration for that For each new daf_butler schema version there should be corresponding alembic revision, naming of that revision can be either random or we can decide to encode daf_butler schema version into a alembic revision. Latter can work if daf_butler versioning is relatively simple, probably couple of versions can be encoded in revision as "core-1.2.3+dim-3.2.1" Potentially we can think of separating management of different parts of schema (like core and dimensions) into separate alembic branches/forest. Branching is still considered beta stage in alembic so it may need some experimenting to decide if it can work. New revision in alembic can be created either manually or auto-generated based on previous version of schema in a database and new Metadata instance generated by the code. Auto-generation does not always work, I expect in many cases it will need some adjusting or at least a revision. If data migration is needed it needs to be added to alembic revision manually. daf_butler is going to keep its version(s) in a special metadata table, and alembic has its own alembic_version table. I think there is a good reason to keep these separate (i.e. don't make daf_butler depend on alembic). Theoretically it may also be possible to teach alembic to construct its revision number from daf_butler metadata table, though looking at the docs I cannot say if it can be done now. If daf_butler metadata and alembic version tables are separate then an additional step is needed when creating completely new database using daf_butler tools - it is better to add alembic version table at the same instant with the matched alembic revision number (e.g. using alembic stamp command). Also if daf_butler metadata table is separate from alembic version then alembic migration script will need to update metadata versions for daf_butler (do data migration on metadata table). Migration of existing databases needs to be synched with the deployment of the new daf_butler releases. Migration can be run in either online of offline mode: in online mode alembic runs against live database, alembic configuration has to specify sqlalchemy connection string for database in offline mode alembic generates SQL that can be executed later on a database, but I think it still needs a connection to a database to extract its current revision (it may be useful to understand whether it can run completely offline by specifying both current revision and sqlalchemy dialect, command line tool does not allow that but API could support it). In general I think above "plan" is more or less abstract, something very similar would have to be done for any other tool we may decide to use or build for schema management.
            Hide
            salnikov Andy Salnikov added a comment -

            Few points on the implementation plan side of things:

            • I think we should start with implementing infrastructure needed to support migration in general:
              • support for metadata in general and versions of the schema to be stored in database in daf_butler-recognized format, this I think is critical for schema stability, needs to be done first
              • ways to specify current schema versions in software, would also be nice to detect schema changes purely on software side (maybe making a sort of digest for Metadata and comparing it to hardcoded version/digest)
              • maybe define alembic-neutral plugin for CLI which allows managing/querying versions in software and database
            • Once this is done we can implement actual schema updates using alembic
            Show
            salnikov Andy Salnikov added a comment - Few points on the implementation plan side of things: I think we should start with implementing infrastructure needed to support migration in general: support for metadata in general and versions of the schema to be stored in database in daf_butler-recognized format, this I think is critical for schema stability, needs to be done first ways to specify current schema versions in software, would also be nice to detect schema changes purely on software side (maybe making a sort of digest for Metadata and comparing it to hardcoded version/digest) maybe define alembic-neutral plugin for CLI which allows managing/querying versions in software and database Once this is done we can implement actual schema updates using alembic
            Hide
            tjenness Tim Jenness added a comment -

            This approach is fine with me. Please ticket and start.

            Show
            tjenness Tim Jenness added a comment - This approach is fine with me. Please ticket and start.
            Hide
            jbosch Jim Bosch added a comment -

            The only thing that I think is missing here is how to handle changes in configuration that change in the schema (both in dimensions configuration and in registry managers or datastore subclasses). I think those are a potentially very powerful tool for allowing the same daf_butler software version compatible with many different in-use schemas (by allowing repos to have different configuration, and hence different schema).

            Maybe you already had the dimensions part in mind, but I think the schema version we use in Alembic needs to encode those class names or some proxy for them as well (as per discussion on the confluence page) - unless you're thinking we should use the Alembic "forest" functionality for those instead.

            I think we also need to modify this statement slightly:

            When there is a new version of the schema in daf_butler with the the new version number/revision (potentially schema can have multiple versions, e.g. core version + dimensions version, this is not critical) and before this new daf_butler is deployed we need to prepare alembic migration for that

            ...because it's really about "deploying" a configuration, not a daf_butler software version, and of course "deploying" a configuration just means creating a repo with a configuration that no one else has used before. Often (e.g. when we modify the daf_butler default configuration) we can anticipate a configuration change when we make a software change, and hence have a migration ready to go, but we should not assume that this is always the case; sometime a migration between two never-default configurations may be needed, and will have to be created on-demand (I don't think creating migrations that connect all possible configurations is feasible).

            Show
            jbosch Jim Bosch added a comment - The only thing that I think is missing here is how to handle changes in configuration that change in the schema (both in dimensions configuration and in registry managers or datastore subclasses). I think those are a potentially very powerful tool for allowing the same daf_butler software version compatible with many different in-use schemas (by allowing repos to have different configuration, and hence different schema). Maybe you already had the dimensions part in mind, but I think the schema version we use in Alembic needs to encode those class names or some proxy for them as well (as per discussion on the confluence page) - unless you're thinking we should use the Alembic "forest" functionality for those instead. I think we also need to modify this statement slightly: When there is a new version of the schema in daf_butler with the the new version number/revision (potentially schema can have multiple versions, e.g. core version + dimensions version, this is not critical) and before this new daf_butler is deployed we need to prepare alembic migration for that ...because it's really about "deploying" a configuration, not a daf_butler software version, and of course "deploying" a configuration just means creating a repo with a configuration that no one else has used before. Often (e.g. when we modify the daf_butler default configuration) we can anticipate a configuration change when we make a software change, and hence have a migration ready to go, but we should not assume that this is always the case; sometime a migration between two never-default configurations may be needed, and will have to be created on-demand (I don't think creating migrations that connect all possible configurations is feasible).
            Hide
            tjenness Tim Jenness added a comment -

            I guess we also have the "Tim changed where the formatter lives and that broke everyone" or "the instrument class changed" cases that are easy to fix (update all the values in the relevant table). These probably don't need versioning at all since updating values from A to B is a no-op for people that have already done it. I'm hoping we won't have these cases often but we might (it's the risk of burning in information at write time).

            Show
            tjenness Tim Jenness added a comment - I guess we also have the "Tim changed where the formatter lives and that broke everyone" or "the instrument class changed" cases that are easy to fix (update all the values in the relevant table). These probably don't need versioning at all since updating values from A to B is a no-op for people that have already done it. I'm hoping we won't have these cases often but we might (it's the risk of burning in information at write time).
            Hide
            salnikov Andy Salnikov added a comment -

            I'm trying to imagine all kinds of use cases that could potentially exist w.r.t. combinations of schema versions and configuration and it looks sort of explosive. Maybe the best way forward is to produce a set of specific examples that we need to support and work from that baseline. I'll try to put together something more specific later on this ticket that we could discuss in detail.

            Regariding Tim's use case - I'm not sure how this is going to work if you also want some sort of compatibility with previous release. If you change existing data stored in the database this could break all existing software versions (except the newest one). We should at least tell those poor guys they need to upgrade, which means that version has to be updated.

            Show
            salnikov Andy Salnikov added a comment - I'm trying to imagine all kinds of use cases that could potentially exist w.r.t. combinations of schema versions and configuration and it looks sort of explosive. Maybe the best way forward is to produce a set of specific examples that we need to support and work from that baseline. I'll try to put together something more specific later on this ticket that we could discuss in detail. Regariding Tim's use case - I'm not sure how this is going to work if you also want some sort of compatibility with previous release. If you change existing data stored in the database this could break all existing software versions (except the newest one). We should at least tell those poor guys they need to upgrade, which means that version has to be updated.
            Hide
            jbosch Jim Bosch added a comment -

            I think adding backwards compatibility and a deprecation period to moving-code-around changes should generally be pretty easy on the code side; it'll just mean we'll have to accept moving more slowly on finalizing those sorts of change than we are right now. I don't think we can safely capture those in any kind of daf_butler-based version, because the whole point is to not define them there, but I think they could be covered by tying migrations to configuration (while making that even more combinatorially explosive).

            Show
            jbosch Jim Bosch added a comment - I think adding backwards compatibility and a deprecation period to moving-code-around changes should generally be pretty easy on the code side; it'll just mean we'll have to accept moving more slowly on finalizing those sorts of change than we are right now. I don't think we can safely capture those in any kind of daf_butler-based version, because the whole point is to not define them there, but I think they could be covered by tying migrations to configuration (while making that even more combinatorially explosive).
            Hide
            salnikov Andy Salnikov added a comment -

            I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class. I don't think we need any complicated dynamic plugin-like mechanism, instead a simple static mapping should be sufficient. I guess all extension types will live either in daf_butler or other packages that we control, so it should not be an issue with managing that static map.

            More general comments on other complicated issues:

            • It is true that configuration (or some part of it) defines schema and we are going to have several components that may evolve more or less independently of each other
            • When I think about schema version I really mean database schema definition and not metadata/configuration, metadata is more dynamic in nature and different configuration does not necessary mean different schema. Imagine that we have two different implementations of some component that can work on exactly the same schema, the configuration will be different but we do not need to do any migrations (especially if we want to switch back and forth between two implementations or if you can override configuration at run time without updating metadata).
            • To me this means using combined component configuration as a version is not super-useful, and it also hard to tell which configurations are compatible and to what extent. Having one version for schema would be much easier to use (e.g. detect compatibility) and manage. Configuration would have to be consistent with that global version number, it may not be easy to verify compatibility at run time until something blows up if configuration is incorrect (but in general we should expect that migrations keep things consistent).
            • We still have somewhat independent components like datastore which could evolve separately. I think some sort of versioning would be needed for that too so for consistency we should probably include it into a global version too or use sort of branching/forest (as a general concept, not necessarily its Alembic implementation). Also the fact that it is very separate today does not guarantee that it will always be like that, we could decide that some dependency may be needed later.
            • Dimensions give all sorts of headache when I try to think how it can be migrated. One simplest example I could think of is adding new dimension to the system, and even such trivial change could result in avalanche of changes in schema for our current implementation. I don't think Alembic auto-generation could handle something like that and writing migration by hand would be a nightmare. OTOH I do not know whether we ever need to extend dimensions, maybe all modifications will be limited to adding dimensions metadata? I was also thinking that having dimension names of subset of dimensions configuration could be added to metadata to help with validation (unless we are already doing something like it in current system).
            • I'm not sure yet whether I want to explore "forest" versioning where separate components (e.g. core+datastore+dimensions) have their separate versions end evolve independently (with occasional dependencies between trees). If we do not envision frequent schema updates then it may be an overkill, managing single version for everything is certainly easier when updates are rare.

             

            Show
            salnikov Andy Salnikov added a comment - I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class. I don't think we need any complicated dynamic plugin-like mechanism, instead a simple static mapping should be sufficient. I guess all extension types will live either in daf_butler or other packages that we control, so it should not be an issue with managing that static map. More general comments on other complicated issues: It is true that configuration (or some part of it) defines schema and we are going to have several components that may evolve more or less independently of each other When I think about schema version I really mean database schema definition and not metadata/configuration, metadata is more dynamic in nature and different configuration does not necessary mean different schema. Imagine that we have two different implementations of some component that can work on exactly the same schema, the configuration will be different but we do not need to do any migrations (especially if we want to switch back and forth between two implementations or if you can override configuration at run time without updating metadata). To me this means using combined component configuration as a version is not super-useful, and it also hard to tell which configurations are compatible and to what extent. Having one version for schema would be much easier to use (e.g. detect compatibility) and manage. Configuration would have to be consistent with that global version number, it may not be easy to verify compatibility at run time until something blows up if configuration is incorrect (but in general we should expect that migrations keep things consistent). We still have somewhat independent components like datastore which could evolve separately. I think some sort of versioning would be needed for that too so for consistency we should probably include it into a global version too or use sort of branching/forest (as a general concept, not necessarily its Alembic implementation). Also the fact that it is very separate today does not guarantee that it will always be like that, we could decide that some dependency may be needed later. Dimensions give all sorts of headache when I try to think how it can be migrated. One simplest example I could think of is adding new dimension to the system, and even such trivial change could result in avalanche of changes in schema for our current implementation. I don't think Alembic auto-generation could handle something like that and writing migration by hand would be a nightmare. OTOH I do not know whether we ever need to extend dimensions, maybe all modifications will be limited to adding dimensions metadata? I was also thinking that having dimension names of subset of dimensions configuration could be added to metadata to help with validation (unless we are already doing something like it in current system). I'm not sure yet whether I want to explore "forest" versioning where separate components (e.g. core+datastore+dimensions) have their separate versions end evolve independently (with occasional dependencies between trees). If we do not envision frequent schema updates then it may be an overkill, managing single version for everything is certainly easier when updates are rare.  
            Hide
            tjenness Tim Jenness added a comment -

            I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class.

            Yes, we very deliberately did not try to store a master list of every possible short name for instrument class and formatter class in daf_butler configuration. This was partly because we were trying to make it so that daf_butler could in theory be usable by other people outside of Rubin, and also because we wanted people to be allowed to experiment without having to make a PR on daf_butler. I suppose a butler subcommand approach could work where each package that defines a formatter also defines a yaml file that butler can read that specifies all the relevant mappings.

            Show
            tjenness Tim Jenness added a comment - I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class. Yes, we very deliberately did not try to store a master list of every possible short name for instrument class and formatter class in daf_butler configuration. This was partly because we were trying to make it so that daf_butler could in theory be usable by other people outside of Rubin, and also because we wanted people to be allowed to experiment without having to make a PR on daf_butler. I suppose a butler subcommand approach could work where each package that defines a formatter also defines a yaml file that butler can read that specifies all the relevant mappings.
            Hide
            salnikov Andy Salnikov added a comment -

            I agree with Tim's last comment - instrument names are definitely outside of our control and we need more dynamic way to specify those.

            I have created couple of tickets for myself to start working on version support and metadata in general (linked to this ticket). I'm going to close this ticket if there are no objections.

            Show
            salnikov Andy Salnikov added a comment - I agree with Tim's last comment - instrument names are definitely outside of our control and we need more dynamic way to specify those. I have created couple of tickets for myself to start working on version support and metadata in general (linked to this ticket). I'm going to close this ticket if there are no objections.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Andy Salnikov, Christopher Stephens [X] (Inactive), Jim Bosch, Michelle Gower, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.