Fix Version/s: None
See https://confluence.lsstcorp.org/display/DM/Dataset+and+Collection+Table+Reorganization; this should move the schema from something like the "Partition datasets (only)" to the final design on that page.
Superseded by https://confluence.lsstcorp.org/display/DM/Architectural+Prototype+for+the+New+Gen3+Registry (which is not that different). Goal of this ticket is now to make runs a type of collection; this involves:
- requiring registration for new tagged collections as well as runs;
- giving the (recursive) tagged collection associated with a processing run a different name than the run-collection. This will probably be the responsibility of pipetask rather than Butler.
This will add the general framework that will let us add more types of collections later (calibration collections and maybe virtual collections), but we won't actually do that here.
DM-21766 Add per-dataset-type tables to Registry
- is contained by
DM-21231 Refactor Registry handling of dataset and associated tables
- is duplicated by
DM-19617 Investigate moving multiple-collection support down to Butler
- is triggered by
RFC-663 Collections in the Gen3 Butler
- is triggering
DM-25016 DM-21724 unpickling error appears again
DM-24414 Implement --prune-replaced option in ctrl_mpexec
- relates to
DM-22163 Add config writing to PipelineTask execution logic
- To Do
Tim Jenness, this isn't quite done (more on that below), but the vast majority of it is ready for review, and as requested I'll try to describe below what's left to do well enough that someone else could take it over (though I think reviewing what's in daf_butler at least should probably happen first, and that will take a while on its own).
There are four intertwined things happening in daf_butler here:
- Refactor the implementation of collections in the Registry to the Manager+Records objects model described on the prototype page.
- Fundamentally change what a "run" is, as described on
RFC-663(also mentioned on the prototype page, but the RFC-663is newer and fleshes out some details).
- Add a new CHAINED collection type, as described on
- Make it so Butler and Registry searches can handle multiple collections at once, moving that logic out of ctrl_mpexec/pipe_base (work originally planned for
From the perspective of daf_butler alone, each of these could have been done on a separate tickets, but they each represent a different disruptive change to downstream package (especially ctrl_mpexec), and I didn't want to go through four of those. So while the commits in daf_butler are somewhat split into separate ranges of commits (see PR), the changes to downstream packages were made to reflect only the final state of daf_butler. Those changes are trivial in most packages, but the changes in obs_base (gen2to3) and ctrl_mpexec (changes to command-line arguments involving collections, as per
RFC-663) are not, though those are still < 10% of the daf_butler changes.
The final status, by package:
- daf_butler: probably done enough to merge once other packages are working with it. Ideally we'd add some unit tests specifically targeting the new code in wildcards.py - that gets a lot of free coverage from usage in higher-level, well-tested code, but it's entirely possible I missed some edge cases. But that's the kind of thing that wouldn't be terrible to defer to another ticket in order to move this one along. As noted above, the correspondence between commits and features will be documented on the PR.
- obs_base: will conflict with John Parejko's
DM-22655but I'm pretty confident I can resolve those without much trouble when the time comes, given that I know a fair bit about both. The big change here is to always ingest directly into a RUN collection (updating config names to reflect that), and then define a CHAINED collection that maps better the parent-child repo linkage in Gen2, and avoid TAGGED collections entirely. These changes also mean that if DM-22655lands first, this ticket will probably need an obs_decam branch similar to the obs_subaru one to update some config overrides.
- ctrl_mpexec: passes tests locally, but not done. See the message on the last git commit for more, but it mostly needs more tests and some API docs for new classes. Real testing will require running ci_hsc_gen3, which I have not tried to do, and which I expect to require (trivial) changes to the command-line arguments in the shell script that invokes pipetask.
- pipe_base: trivial changes to GraphBuilder to reflect the downstream changes in how we pass collections in when building QuantumGraphs.
- obs_subaru: trivial config changes to adapt to obs_base changes.
- ci_hsc_gen2: trivial changes to adapt to daf_butler and obs_base changes.
I've reviewed the daf_butler and obs_base changes.
I've rebased this onto newly-merged
DM-22655 and tested it with obs_subaru and obs_decam. I've also fixed ci_hsc_gen3. I will take a look at ci_hsc_gen2.
I looked at pipe_base and ctrl_mpexec, looks OK, some minor comments on PRs.
Note to self: I've put a WIP commit on tickets/
DM-19617that may be relevant for this ticket, if we want to do them both on the same branch to avoid two API changes in close succession.