Thinking about what that cleanup script would need to do, how to implement it, and how it can affect schema of the obscore table.
The current obscore manager implementation only supports the following registry operations:
- adding datasets to a monitored RUN collection (or if a RUN collection is in monitored CHAINED collections) via Registry.insertDatasets() or Registry._importDatasets()
- associating/certifying datasets with a monitored TAGGED/CALIBRATION collection (or it is in monitored CHAINED collections) via Registry.associate() and Registry.certify()
- a dataset can only appear once in a live obscore table (its UUID is a primary key in that table)
- if dataset's RUN collection is also monitored then the dataset will be added to obscore when it is added to a RUN collection naturally, otherwise it will be added when it is associated/certified
- if dataset is associated/certified many times with different validity timespans or in different collections it will only appear in obscore once (obscore does not have validity ranges or collection names to distinguish them anyways)
- dataset UUID is also a foreign key into dataset table, so when a dataset is removed from registry (individually or as a whole RUN collection) it also disappears from obscore
What is not covered by the above cases and what should be handled by the cleanup script:
- disassociating/decertifying datasets, either individually or due to removal of the whole TAGGED/CALIBRATION collections, which should result in removal from obscore in some cases
- datasets that come from monitored RUN collection or belong to other monitored TAGGED/CALIBRATION collections should be kept in obscore, of course
- changes in a composition of monitored CHAINED collections:
- if a non-empty RUN collections is added to a CHAINED collection, all matching datasets need to be added to obscore; same applies to TAGGED/CALIBRATION collections
- removal of a RUN collection from a monitored CHAINED collection should remove its datasets, except when they are associated with monitored TAGGED/CALIBRATION collections
- removal of a TAGGED/CALIBRATION collections remove its datasets, except when they belong to monitored RUN collection or other monitored TAGGED/CALIBRATION collections
Note that changes in the list of monitored collection that happen due to configuration change are supposed to be handled by the schema migration script, though there is definitely an overlap in that change in a CHAINED collection composition also causes similar "migrations".
Doing those removals/additions after the fact in the cleanup scrip is an interesting problem. Checking those conditions (whether dataset belongs to some other monitored collection) will be non-trivial, and I'm afraid it's not going to scale if we try to do it all on Python side. What could probably help is if we could quickly determine which datasets are NOT in any monitored collections. I think something like it would be possible if we could have a foreign key to a dataset associations/certification (and use CASCADE to remove things that disappear). It's not possible in current schema when we have multiple tags/calibs table, so foreign key cannot be defined (unless I repeat the same table structure used by tags/calibs). I need to think if there is anything at all that can help us here.
One issue with the new obscore manager being in a separate package is that anything that uses butler or registry with obscore-configured repo would need to setup that extra package. So running any butler command would need e.g. dax__obscore, which effectively makes dax_obscore a dependency of daf_butler (circular too). I guess this means that manager has to be in daf_butler, unless we want to play some dirty tricks for specific repositories which have obscore data.
I can't get rid of a thought that implementing everything as a separate middleware HTTP service (on top of our middleware) would solve all our numerous problems