Now that I've gotten my head out of CALIBRATION collections, I'm going to try to write down the thoughts I had on this topic while prototyping, in the hopes that'll help Andy Salnikov and I coordinate as we both make rather big changes to the dimensions system (mine will be over on DM-26692, or subtask tickets thereof). My thoughts are not very well organized, and I'd be happy to have a live chat of some kind about them, but hopefully just writing them down now will help somewhat:
My vision of the end-state for dimensions (beyond just this ticket) is that DimensionUniverse would become mutable, and either merge with DimensionRecordStorageManager somehow (e.g. DimensionUniverse becomes an ABC, with DimensionRecordStorageManager an implementation) or have some kind of Observer pattern with it so DimensionRecordStorageManager could update the database when new dimensions are registered (or other kinds of "additive" changes are requested). For this ticket, I think we only care about saving the dimensions in the database instead of the configs (but keeping them fixed after repo creation), but I'm hoping we could do that in a way that is compatible (i.e. with no future schema changes) with the longer-term vision.
If we had the time, I would definitely start by making DimensionUniverse mutable (and I think we might want to anyway; read on). That's because DimensionUniverse construction is currently very tightly coupled to the config representation, and I think that needs to change in order to also support a database representation (yes, we could serialize the config to a blob column, but right now we can't even write out the config representation from an in-memory DimensionUniverse, and I think a blob-config database representation is a poor fit to the longer-term vision anyway). So a natural way to decouple DimensionUniverse construction from config would be to make DimensionUniverse default-constructable and mutable - we could then have separate from-config and to/from-database serialization code that would both operate on DimensionUniverse methods, delegating to them the harder work of making sure relationships make sense and invariants are preserved.
The problem with that is that DimensionUniverse is very fragile right now; adding a new Dimension or DimensionElement affects many others that do not seem like they should be related at all. That's basically because DimensionUniverse keeps big sorted collections of all elements, and indexes into those collections are used in the DimensionGraph.encode and DimensionGraph.decode methods that are used to save DimensionGraph objects into the database (e.g. in the dataset_type table). So adding a new Dimension totally breaks all existing DimensionGraph objects (and hence all DatasetTypes, etc.) unless it is only appended to the very end of those collections (which is obviously incompatible with maintaining a deterministic+topological sort order).
I think the right way to fix this is to remove DimensionGraph.encode and DimensionGraph.decode, and instead give DimensionRecordStorageManager methods to register and retrieve persistent DimensionGraphs to the database, mapping them to some kind of opaque, unique, persistent key (autoincrement or secure hash of names both seem like they would work, though secure hash is better if we want it to be meaningful across repos/universes). We'd use that key everywhere we use the encoding right now; DatasetRecordStorageManager would need to register a DimensionGraph every time it registers a new DatasetType, for example. I think we could then make the indexes into those sorted collections of dimensions totally private; I think we want to keep the sorted collections themselves to implement DimensionUniverse.sorted.
It's worth noting that DimensionGraph has its own sorted collections of dimensions whose indexes are used, but because adding a new Dimension to the universe doesn't affect those per-DimensionGraph indexes, this isn't a problem. I think it's also nice to have a notion of "registered/in-use" DimensionGraphs from the perspective of knowing whether a requested change to a DimensionUniverse should be allowed or not, even though that's a future concern.
So, all of the above suggests that one approach to this ticket (and the one I think I'd recommend) is:
- Implement registration/retrieval methods for DimensionGraphs on DimensionRecordStorageManager.
- Remove DimensionGraph.encode/decode in favor of those.
- Make DimensionUniverse default-constructable and mutable, and rewrite the DimensionUniverse-from-config code to use those methods. We'd probably want a runtime flag on DimensionUniverse to freeze it after we're done constructing it, for now.
- Add new code to use the same mutator methods to construct a DimensionUniverse from a database representation, as well as code to write the DimensionUniverse to that database representation.
- (On some other ticket, after Nov. 1), remove the runtime-freezing flag on DimensionUniverse and instead connect it to DimensionRecordStorageManager so new dimensions could be registered (etc.) after repo construction.
I like that order in part because I think steps (1) and (2) are less related to changes I want to make to the dimensions system for DM-26692, so I'm hoping I can get those done and we can talk again before you get to (3) and (4).
Is that at all compatible with what you had in mind for this ticket? Is there anything important that I've missed? Is there some other order that would get us the important-for-schema-stability-parts first? At first I thought we might want to start by leaving DimensionUniverse immutable and adding a Builder class for both the config and database serializiaton code to use, and then not have to worry about DimensionGraph's fragile encoding until after Nov. 1. But removing DimensionGraph encoding is a schema change, too, so if we're going to do that at all I don't think we want to put it off.
I have prototyped some of this on
DM-26336(the part about how to serialize the DimensionUniverse to database tables), and it seems quite doable. That got me thinking that we might also want to make DimensionUniverse mutable, and have Registry APIs for adding new dimensions and columns dynamically to a repo (analogous to DatasetType registration); the config file in daf_butler would then just represent the initial/default. I'm writing a Confluence page to document what that prototyping ticket came up with, and will have more details there.