Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25355

Add support for defining schema version defined by daf_butler

    XMLWordPrintable

    Details

    • Story Points:
      6
    • Sprint:
      DB_F20_06
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      AS discussed in DM-24803 we need a way to manage schema versions defined in daf_butler. This needs some additional decisions though:

      • do we want one global schema version or per-component
      • datastore seems to be special case
      • dimensions is also very special
      • how do we detect schema change
      • etc.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I'm happy to close this ticket now (I'll go make some comments on GitHub after this, so there may be some minor things worth waiting for there). I'll continue the big-picture conversation here in the meantime.

            I think we might mean different things sometimes when we talk about schema versions (I may not even be consistent myself all the time). My mental model is that there are many human-meaningful identifiers (class names, version tuples, etc.) that together fully determine the static schema. There is no human-meaningful identifier for the full (static) schema - all we can ever have is a hash or other digest of all of the human-meaningful identifiers. That hash doesn't imply any kind of version ordering, and I figured that would be okay because Alembic also uses hashes to identify distinct schemas.

            I was thinking about (A) case but I think it complicates things a lot. Suppose we have two implementations of the same extension (e.g. FasterDatasetManager and NicerDatasetManager) with different schemas and we want to be able to switch between one or another. The questions that I do not know how to answer:

            - how do we allocate version numbers for those incompatible things that exists at the same time, and each of them evolves with potential schema updates in the future?

            I was thinking that under (A), each class gets its own totally independent version number, so you could be switching from FasterDatasetManager 72.3.1 to NicerDatasetManager 1.5.3. There wouldn't be any expectation that those numbers would be comparable, or that they would totally control any part of the schema. All we require is that we can hash them with the class names to get a hash for the full schema.

            - how do people who want to change their config from one extension to another know what is the schema version for that other extension? I think it means that schema/digest should be defined somewhere in the code instead of config file.

            Yes, under this approach each class's version would be fixed in the code that defines that class, and only the class name would appear in the config.

            - how do we identify that mess to the schema migration tool like alembic, do we want to support migration from any combination of versions to any other combination, probably not, that will be too hard to manage?

            My mental model here (which I hope is compatible with Alembic) is that migrations are the elements of a huge square matrix where the rows and columns represent full-schema hashes (combinatorially large). I am envisioning a sparse matrix, where we could create (and label, and store) a migration for any combination (manually if necessary), but would in practice only create the much, much, smaller fraction of migrations that people actually need.

            My thinking now is that any non-linear migration path for schema is a significant burden to manage, adding multiple dimensions (not Dimensions dimensions but extensions versions as dimensions) to the picture is probably going to make it incomprehensible.

            I am somewhat willing to punt on this question right now, because I think it does depend more on operational workflows and e.g. how many different manager subclasses we actually ever have active for a particular slot. In other words I'm not so much concerned with whether we guarantee a migration path between any possible pair of schemas, but only with whether we can rigorously label and save such a migration if someone made one, and have some hope that we would be able to use migrations that exist to help create larger migration paths.

            Show
            jbosch Jim Bosch added a comment - I'm happy to close this ticket now (I'll go make some comments on GitHub after this, so there may be some minor things worth waiting for there). I'll continue the big-picture conversation here in the meantime. I think we might mean different things sometimes when we talk about schema versions (I may not even be consistent myself all the time). My mental model is that there are many human-meaningful identifiers (class names, version tuples, etc.) that together fully determine the static schema. There is no human-meaningful identifier for the full (static) schema - all we can ever have is a hash or other digest of all of the human-meaningful identifiers. That hash doesn't imply any kind of version ordering, and I figured that would be okay because Alembic also uses hashes to identify distinct schemas. I was thinking about (A) case but I think it complicates things a lot. Suppose we have two implementations of the same extension (e.g. FasterDatasetManager and NicerDatasetManager) with different schemas and we want to be able to switch between one or another. The questions that I do not know how to answer: - how do we allocate version numbers for those incompatible things that exists at the same time, and each of them evolves with potential schema updates in the future? I was thinking that under (A), each class gets its own totally independent version number, so you could be switching from FasterDatasetManager 72.3.1 to NicerDatasetManager 1.5.3. There wouldn't be any expectation that those numbers would be comparable, or that they would totally control any part of the schema. All we require is that we can hash them with the class names to get a hash for the full schema. - how do people who want to change their config from one extension to another know what is the schema version for that other extension? I think it means that schema/digest should be defined somewhere in the code instead of config file. Yes, under this approach each class's version would be fixed in the code that defines that class, and only the class name would appear in the config. - how do we identify that mess to the schema migration tool like alembic, do we want to support migration from any combination of versions to any other combination, probably not, that will be too hard to manage? My mental model here (which I hope is compatible with Alembic) is that migrations are the elements of a huge square matrix where the rows and columns represent full-schema hashes (combinatorially large). I am envisioning a sparse matrix, where we could create (and label, and store) a migration for any combination (manually if necessary), but would in practice only create the much, much, smaller fraction of migrations that people actually need. My thinking now is that any non-linear migration path for schema is a significant burden to manage, adding multiple dimensions (not Dimensions dimensions but extensions versions as dimensions) to the picture is probably going to make it incomprehensible. I am somewhat willing to punt on this question right now, because I think it does depend more on operational workflows and e.g. how many different manager subclasses we actually ever have active for a particular slot. In other words I'm not so much concerned with whether we guarantee a migration path between any possible pair of schemas, but only with whether we can rigorously label and save such a migration if someone made one, and have some hope that we would be able to use migrations that exist to help create larger migration paths.
            Hide
            salnikov Andy Salnikov added a comment -

            Few thoughts of my own.

            Having a linear history is a huge bonus. If you look at git for example which has a lot of nice tools to manage branching and merging, we still prefer linear history in our own development where all changes merge to master and branches are destroyed after merge. Tools for managing schema history are not so advanced as git of course, it will inevitably lead to a lot of complications if we need branches and merging between branches (and potential dependencies between branches in a forest). I do not say it's impossible to do, but I think it will require an expert to do anything in that complicated setup.

            If we do (A) with a separate "version" for each manager (version will likely include manager class/implementation name and version tuple):

            • I think we'll have to provide migration script for all versions of the the same manager class, e.g.: FasterDatasetManager-1.2.0 to FasterDatasetManager-1.2.1; FasterDatasetManager-1.2.1 to FasterDatasetManager-1.2.2; FasterDatasetManager-1.2.2 to FasterDatasetManager-1.3.0; etc. Basically there is no reason to define a new version of the manager class without also providing migration from the previous version oft he same class. So for the same manager class we are going to have a linear history.
            • If a manager has more than one implementation then for each "official" release that we give to the world we probably want to support switching between two (or any two) implementations in the same release. This means that migration should be supported in both directions. I'm not sure I can think about this clearly right now but it may mean that any such switch could multiply the number of branches in a history tree. Maybe I need to draw some pictures. OR maybe it just does not make sense talking about versions here (in alembic sense).
            • Even if we don't have overall version numbers but only hashes in alembic, there is still an implicit ordering of hashes, exactly like in git, each hash has its parent hash, or sometime more than one parent hash. You cannot migrate between arbitrary hashes unless there is a defined and ordered path between those hashes.
            Show
            salnikov Andy Salnikov added a comment - Few thoughts of my own. Having a linear history is a huge bonus. If you look at git for example which has a lot of nice tools to manage branching and merging, we still prefer linear history in our own development where all changes merge to master and branches are destroyed after merge. Tools for managing schema history are not so advanced as git of course, it will inevitably lead to a lot of complications if we need branches and merging between branches (and potential dependencies between branches in a forest). I do not say it's impossible to do, but I think it will require an expert to do anything in that complicated setup. If we do (A) with a separate "version" for each manager (version will likely include manager class/implementation name and version tuple): I think we'll have to provide migration script for all versions of the the same manager class, e.g.: FasterDatasetManager-1.2.0 to FasterDatasetManager-1.2.1; FasterDatasetManager-1.2.1 to FasterDatasetManager-1.2.2; FasterDatasetManager-1.2.2 to FasterDatasetManager-1.3.0; etc. Basically there is no reason to define a new version of the manager class without also providing migration from the previous version oft he same class. So for the same manager class we are going to have a linear history. If a manager has more than one implementation then for each "official" release that we give to the world we probably want to support switching between two (or any two) implementations in the same release. This means that migration should be supported in both directions. I'm not sure I can think about this clearly right now but it may mean that any such switch could multiply the number of branches in a history tree. Maybe I need to draw some pictures. OR maybe it just does not make sense talking about versions here (in alembic sense). Even if we don't have overall version numbers but only hashes in alembic, there is still an implicit ordering of hashes, exactly like in git, each hash has its parent hash, or sometime more than one parent hash. You cannot migrate between arbitrary hashes unless there is a defined and ordered path between those hashes.
            Hide
            salnikov Andy Salnikov added a comment -

            Before I merge my changes I decided to disable saving of versions for the moment as we still do not have clear model for what versions are going to be so having them now in the database is not useful. I started new Jenkins build, will merge when it finishes.

            Show
            salnikov Andy Salnikov added a comment - Before I merge my changes I decided to disable saving of versions for the moment as we still do not have clear model for what versions are going to be so having them now in the database is not useful. I started new Jenkins build, will merge when it finishes.
            Hide
            jbosch Jim Bosch added a comment -

            to all of those ideas about trying to keep history as linear as possible.  I do think the git analogy is a good one from both angles: it is very important to have a tool that can in principle handle arbitrary branches and merges, but for the human(s) using them (and responsible for resolving merge conflicts or writing nontrivial migration scripts) it is very important to have conventions that minimize that branching and merging.

            Show
            jbosch Jim Bosch added a comment - to all of those ideas about trying to keep history as linear as possible.  I do think the git analogy is a good one from both angles: it is very important to have a tool that can in principle handle arbitrary branches and merges, but for the human(s) using them (and responsible for resolving merge conflicts or writing nontrivial migration scripts) it is very important to have conventions that minimize that branching and merging.
            Hide
            salnikov Andy Salnikov added a comment -

            Jenkins passed and I have merged PR.

            Thanks for useful discussion and lots of hints! Closing this ticket, much more work left for other tickets.

            Show
            salnikov Andy Salnikov added a comment - Jenkins passed and I have merged PR. Thanks for useful discussion and lots of hints! Closing this ticket, much more work left for other tickets.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.