Fix Version/s: None
Team:Data Access and Database
As part of the plan for supporting schema migrations we would like to investigate the possibility of using Alembic. Christopher Stephens [X] has done a preliminary investigation and reports a positive experience. Now we need to see what is involved in modifying daf_butler to properly support this.
I think adding backwards compatibility and a deprecation period to moving-code-around changes should generally be pretty easy on the code side; it'll just mean we'll have to accept moving more slowly on finalizing those sorts of change than we are right now. I don't think we can safely capture those in any kind of daf_butler-based version, because the whole point is to not define them there, but I think they could be covered by tying migrations to configuration (while making that even more combinatorially explosive).
I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class. I don't think we need any complicated dynamic plugin-like mechanism, instead a simple static mapping should be sufficient. I guess all extension types will live either in daf_butler or other packages that we control, so it should not be an issue with managing that static map.
More general comments on other complicated issues:
- It is true that configuration (or some part of it) defines schema and we are going to have several components that may evolve more or less independently of each other
- When I think about schema version I really mean database schema definition and not metadata/configuration, metadata is more dynamic in nature and different configuration does not necessary mean different schema. Imagine that we have two different implementations of some component that can work on exactly the same schema, the configuration will be different but we do not need to do any migrations (especially if we want to switch back and forth between two implementations or if you can override configuration at run time without updating metadata).
- To me this means using combined component configuration as a version is not super-useful, and it also hard to tell which configurations are compatible and to what extent. Having one version for schema would be much easier to use (e.g. detect compatibility) and manage. Configuration would have to be consistent with that global version number, it may not be easy to verify compatibility at run time until something blows up if configuration is incorrect (but in general we should expect that migrations keep things consistent).
- We still have somewhat independent components like datastore which could evolve separately. I think some sort of versioning would be needed for that too so for consistency we should probably include it into a global version too or use sort of branching/forest (as a general concept, not necessarily its Alembic implementation). Also the fact that it is very separate today does not guarantee that it will always be like that, we could decide that some dependency may be needed later.
- Dimensions give all sorts of headache when I try to think how it can be migrated. One simplest example I could think of is adding new dimension to the system, and even such trivial change could result in avalanche of changes in schema for our current implementation. I don't think Alembic auto-generation could handle something like that and writing migration by hand would be a nightmare. OTOH I do not know whether we ever need to extend dimensions, maybe all modifications will be limited to adding dimensions metadata? I was also thinking that having dimension names of subset of dimensions configuration could be added to metadata to help with validation (unless we are already doing something like it in current system).
- I'm not sure yet whether I want to explore "forest" versioning where separate components (e.g. core+datastore+dimensions) have their separate versions end evolve independently (with occasional dependencies between trees). If we do not envision frequent schema updates then it may be an overkill, managing single version for everything is certainly easier when updates are rare.
I believe moving/renaming issue is easy to avoid altogether if we don't store qualified class names in the configuration and instead use some smart factory method to hide actual location of the class.
Yes, we very deliberately did not try to store a master list of every possible short name for instrument class and formatter class in daf_butler configuration. This was partly because we were trying to make it so that daf_butler could in theory be usable by other people outside of Rubin, and also because we wanted people to be allowed to experiment without having to make a PR on daf_butler. I suppose a butler subcommand approach could work where each package that defines a formatter also defines a yaml file that butler can read that specifies all the relevant mappings.
I agree with Tim's last comment - instrument names are definitely outside of our control and we need more dynamic way to specify those.
I have created couple of tickets for myself to start working on version support and metadata in general (linked to this ticket). I'm going to close this ticket if there are no objections.
I'm trying to imagine all kinds of use cases that could potentially exist w.r.t. combinations of schema versions and configuration and it looks sort of explosive. Maybe the best way forward is to produce a set of specific examples that we need to support and work from that baseline. I'll try to put together something more specific later on this ticket that we could discuss in detail.
Regariding Tim's use case - I'm not sure how this is going to work if you also want some sort of compatibility with previous release. If you change existing data stored in the database this could break all existing software versions (except the newest one). We should at least tell those poor guys they need to upgrade, which means that version has to be updated.