Fix Version/s: None
Team:Data Access and Database
DM-35850 defined main ideas for adding obscore table updates to Registry code. This ticket is for implementing actual thing, I'll add more comments as I go.
- is triggered by
DM-35850 Collect ideas and requirements for daf_butler obscore implementation
- is triggering
DM-36489 Implement spatial indexing for live obscore table.
- relates to
DM-37275 Establish ObsTAP service for use with a Postgres Butler registry
- To Do
- links to
Thinking about what that cleanup script would need to do, how to implement it, and how it can affect schema of the obscore table.
The current obscore manager implementation only supports the following registry operations:
- adding datasets to a monitored RUN collection (or if a RUN collection is in monitored CHAINED collections) via Registry.insertDatasets() or Registry._importDatasets()
- associating/certifying datasets with a monitored TAGGED/CALIBRATION collection (or it is in monitored CHAINED collections) via Registry.associate() and Registry.certify()
- a dataset can only appear once in a live obscore table (its UUID is a primary key in that table)
- if dataset's RUN collection is also monitored then the dataset will be added to obscore when it is added to a RUN collection naturally, otherwise it will be added when it is associated/certified
- if dataset is associated/certified many times with different validity timespans or in different collections it will only appear in obscore once (obscore does not have validity ranges or collection names to distinguish them anyways)
- dataset UUID is also a foreign key into dataset table, so when a dataset is removed from registry (individually or as a whole RUN collection) it also disappears from obscore
What is not covered by the above cases and what should be handled by the cleanup script:
- disassociating/decertifying datasets, either individually or due to removal of the whole TAGGED/CALIBRATION collections, which should result in removal from obscore in some cases
- datasets that come from monitored RUN collection or belong to other monitored TAGGED/CALIBRATION collections should be kept in obscore, of course
- changes in a composition of monitored CHAINED collections:
- if a non-empty RUN collections is added to a CHAINED collection, all matching datasets need to be added to obscore; same applies to TAGGED/CALIBRATION collections
- removal of a RUN collection from a monitored CHAINED collection should remove its datasets, except when they are associated with monitored TAGGED/CALIBRATION collections
- removal of a TAGGED/CALIBRATION collections remove its datasets, except when they belong to monitored RUN collection or other monitored TAGGED/CALIBRATION collections
Note that changes in the list of monitored collection that happen due to configuration change are supposed to be handled by the schema migration script, though there is definitely an overlap in that change in a CHAINED collection composition also causes similar "migrations".
Doing those removals/additions after the fact in the cleanup scrip is an interesting problem. Checking those conditions (whether dataset belongs to some other monitored collection) will be non-trivial, and I'm afraid it's not going to scale if we try to do it all on Python side. What could probably help is if we could quickly determine which datasets are NOT in any monitored collections. I think something like it would be possible if we could have a foreign key to a dataset associations/certification (and use CASCADE to remove things that disappear). It's not possible in current schema when we have multiple tags/calibs table, so foreign key cannot be defined (unless I repeat the same table structure used by tags/calibs). I need to think if there is anything at all that can help us here.
Gregory Dubois-Felsmann, I have added a section to DMTN-236 which describes my latest idea on what to do with collections for the new obscore manager: https://dmtn-236.lsst.io/v/DM-35947/index.html#collection-issues-with-client-side-updates
In addition to what we discussed before (multiple run collections) I can implement support for a single tagged collection. That approach has one advantage over run list (details are in technote) but it shifts all hard work to the clients who will need to associate "worthy" datasets to a separate tagged collection. I'd like to get some input from you and anyone else on whether this could be useful before I start implementing all of that.
Jim Bosch, I think this ticket is ready for "final" review. There were some big changes since you last looked at it and I did not want to squash commits, but it may be easier to look at complete diff instead of individual commits. dax_obscore adds new butler command which updates obscore records with missing regions (raws), which I think is also worth reviewing.
I have not implemented migration script yet, which would be necessary to start deploying it. We still need to do something for spatial indexing, and we also need to agree on configuration before we can deploy it (changing configuration needs migration, I prefer to have ~stable configuration as a starting point).
One issue with the new obscore manager being in a separate package is that anything that uses butler or registry with obscore-configured repo would need to setup that extra package. So running any butler command would need e.g. dax__obscore, which effectively makes dax_obscore a dependency of daf_butler (circular too). I guess this means that manager has to be in daf_butler, unless we want to play some dirty tricks for specific repositories which have obscore data.
I can't get rid of a thought that implementing everything as a separate middleware HTTP service (on top of our middleware) would solve all our numerous problems