Fix Version/s: None
Team:Data Access and Database
AS discussed in
DM-24803 we need a way to manage schema versions defined in daf_butler. This needs some additional decisions though:
- do we want one global schema version or per-component
- datastore seems to be special case
- dimensions is also very special
- how do we detect schema change
Few thoughts of my own.
Having a linear history is a huge bonus. If you look at git for example which has a lot of nice tools to manage branching and merging, we still prefer linear history in our own development where all changes merge to master and branches are destroyed after merge. Tools for managing schema history are not so advanced as git of course, it will inevitably lead to a lot of complications if we need branches and merging between branches (and potential dependencies between branches in a forest). I do not say it's impossible to do, but I think it will require an expert to do anything in that complicated setup.
If we do (A) with a separate "version" for each manager (version will likely include manager class/implementation name and version tuple):
- I think we'll have to provide migration script for all versions of the the same manager class, e.g.: FasterDatasetManager-1.2.0 to FasterDatasetManager-1.2.1; FasterDatasetManager-1.2.1 to FasterDatasetManager-1.2.2; FasterDatasetManager-1.2.2 to FasterDatasetManager-1.3.0; etc. Basically there is no reason to define a new version of the manager class without also providing migration from the previous version oft he same class. So for the same manager class we are going to have a linear history.
- If a manager has more than one implementation then for each "official" release that we give to the world we probably want to support switching between two (or any two) implementations in the same release. This means that migration should be supported in both directions. I'm not sure I can think about this clearly right now but it may mean that any such switch could multiply the number of branches in a history tree. Maybe I need to draw some pictures. OR maybe it just does not make sense talking about versions here (in alembic sense).
- Even if we don't have overall version numbers but only hashes in alembic, there is still an implicit ordering of hashes, exactly like in git, each hash has its parent hash, or sometime more than one parent hash. You cannot migrate between arbitrary hashes unless there is a defined and ordered path between those hashes.
Before I merge my changes I decided to disable saving of versions for the moment as we still do not have clear model for what versions are going to be so having them now in the database is not useful. I started new Jenkins build, will merge when it finishes.
to all of those ideas about trying to keep history as linear as possible. I do think the git analogy is a good one from both angles: it is very important to have a tool that can in principle handle arbitrary branches and merges, but for the human(s) using them (and responsible for resolving merge conflicts or writing nontrivial migration scripts) it is very important to have conventions that minimize that branching and merging.
Jenkins passed and I have merged PR.
Thanks for useful discussion and lots of hints! Closing this ticket, much more work left for other tickets.
I'm happy to close this ticket now (I'll go make some comments on GitHub after this, so there may be some minor things worth waiting for there). I'll continue the big-picture conversation here in the meantime.
I think we might mean different things sometimes when we talk about schema versions (I may not even be consistent myself all the time). My mental model is that there are many human-meaningful identifiers (class names, version tuples, etc.) that together fully determine the static schema. There is no human-meaningful identifier for the full (static) schema - all we can ever have is a hash or other digest of all of the human-meaningful identifiers. That hash doesn't imply any kind of version ordering, and I figured that would be okay because Alembic also uses hashes to identify distinct schemas.
I was thinking that under (A), each class gets its own totally independent version number, so you could be switching from FasterDatasetManager 72.3.1 to NicerDatasetManager 1.5.3. There wouldn't be any expectation that those numbers would be comparable, or that they would totally control any part of the schema. All we require is that we can hash them with the class names to get a hash for the full schema.
Yes, under this approach each class's version would be fixed in the code that defines that class, and only the class name would appear in the config.
My mental model here (which I hope is compatible with Alembic) is that migrations are the elements of a huge square matrix where the rows and columns represent full-schema hashes (combinatorially large). I am envisioning a sparse matrix, where we could create (and label, and store) a migration for any combination (manually if necessary), but would in practice only create the much, much, smaller fraction of migrations that people actually need.
I am somewhat willing to punt on this question right now, because I think it does depend more on operational workflows and e.g. how many different manager subclasses we actually ever have active for a particular slot. In other words I'm not so much concerned with whether we guarantee a migration path between any possible pair of schemas, but only with whether we can rigorously label and save such a migration if someone made one, and have some hope that we would be able to use migrations that exist to help create larger migration paths.