I'm happy to close this ticket now (I'll go make some comments on GitHub after this, so there may be some minor things worth waiting for there). I'll continue the big-picture conversation here in the meantime.
I think we might mean different things sometimes when we talk about schema versions (I may not even be consistent myself all the time). My mental model is that there are many human-meaningful identifiers (class names, version tuples, etc.) that together fully determine the static schema. There is no human-meaningful identifier for the full (static) schema - all we can ever have is a hash or other digest of all of the human-meaningful identifiers. That hash doesn't imply any kind of version ordering, and I figured that would be okay because Alembic also uses hashes to identify distinct schemas.
I was thinking about (A) case but I think it complicates things a lot. Suppose we have two implementations of the same extension (e.g. FasterDatasetManager and NicerDatasetManager) with different schemas and we want to be able to switch between one or another. The questions that I do not know how to answer:
- how do we allocate version numbers for those incompatible things that exists at the same time, and each of them evolves with potential schema updates in the future?
I was thinking that under (A), each class gets its own totally independent version number, so you could be switching from FasterDatasetManager 72.3.1 to NicerDatasetManager 1.5.3. There wouldn't be any expectation that those numbers would be comparable, or that they would totally control any part of the schema. All we require is that we can hash them with the class names to get a hash for the full schema.
- how do people who want to change their config from one extension to another know what is the schema version for that other extension? I think it means that schema/digest should be defined somewhere in the code instead of config file.
Yes, under this approach each class's version would be fixed in the code that defines that class, and only the class name would appear in the config.
- how do we identify that mess to the schema migration tool like alembic, do we want to support migration from any combination of versions to any other combination, probably not, that will be too hard to manage?
My mental model here (which I hope is compatible with Alembic) is that migrations are the elements of a huge square matrix where the rows and columns represent full-schema hashes (combinatorially large). I am envisioning a sparse matrix, where we could create (and label, and store) a migration for any combination (manually if necessary), but would in practice only create the much, much, smaller fraction of migrations that people actually need.
My thinking now is that any non-linear migration path for schema is a significant burden to manage, adding multiple dimensions (not Dimensions dimensions but extensions versions as dimensions) to the picture is probably going to make it incomprehensible.
I am somewhat willing to punt on this question right now, because I think it does depend more on operational workflows and e.g. how many different manager subclasses we actually ever have active for a particular slot. In other words I'm not so much concerned with whether we guarantee a migration path between any possible pair of schemas, but only with whether we can rigorously label and save such a migration if someone made one, and have some hope that we would be able to use migrations that exist to help create larger migration paths.
I'm happy to close this ticket now (I'll go make some comments on GitHub after this, so there may be some minor things worth waiting for there). I'll continue the big-picture conversation here in the meantime.
I think we might mean different things sometimes when we talk about schema versions (I may not even be consistent myself all the time). My mental model is that there are many human-meaningful identifiers (class names, version tuples, etc.) that together fully determine the static schema. There is no human-meaningful identifier for the full (static) schema - all we can ever have is a hash or other digest of all of the human-meaningful identifiers. That hash doesn't imply any kind of version ordering, and I figured that would be okay because Alembic also uses hashes to identify distinct schemas.
I was thinking that under (A), each class gets its own totally independent version number, so you could be switching from FasterDatasetManager 72.3.1 to NicerDatasetManager 1.5.3. There wouldn't be any expectation that those numbers would be comparable, or that they would totally control any part of the schema. All we require is that we can hash them with the class names to get a hash for the full schema.
Yes, under this approach each class's version would be fixed in the code that defines that class, and only the class name would appear in the config.
My mental model here (which I hope is compatible with Alembic) is that migrations are the elements of a huge square matrix where the rows and columns represent full-schema hashes (combinatorially large). I am envisioning a sparse matrix, where we could create (and label, and store) a migration for any combination (manually if necessary), but would in practice only create the much, much, smaller fraction of migrations that people actually need.
I am somewhat willing to punt on this question right now, because I think it does depend more on operational workflows and e.g. how many different manager subclasses we actually ever have active for a particular slot. In other words I'm not so much concerned with whether we guarantee a migration path between any possible pair of schemas, but only with whether we can rigorously label and save such a migration if someone made one, and have some hope that we would be able to use migrations that exist to help create larger migration paths.