Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21849

Make runs a type of collection

    XMLWordPrintable

    Details

      Description

      See https://confluence.lsstcorp.org/display/DM/Dataset+and+Collection+Table+Reorganization; this should move the schema from something like the "Partition datasets (only)" to the final design  on that page.

      Superseded by https://confluence.lsstcorp.org/display/DM/Architectural+Prototype+for+the+New+Gen3+Registry (which is not that different). Goal of this ticket is now to make runs a type of collection; this involves:

      • requiring registration for new tagged collections as well as runs;
      • giving the (recursive) tagged collection associated with a processing run a different name than the run-collection. This will probably be the responsibility of pipetask rather than Butler.

      This will add the general framework that will let us add more types of collections later (calibration collections and maybe virtual collections), but we won't actually do that here.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Note to self: I've put a WIP commit on tickets/DM-19617 that may be relevant for this ticket, if we want to do them both on the same branch to avoid two API changes in close succession.

            Show
            jbosch Jim Bosch added a comment - Note to self: I've put a WIP commit on tickets/ DM-19617 that may be relevant for this ticket, if we want to do them both on the same branch to avoid two API changes in close succession.
            Hide
            jbosch Jim Bosch added a comment -

            Tim Jenness, this isn't quite done (more on that below), but the vast majority of it is ready for review, and as requested I'll try to describe below what's left to do well enough that someone else could take it over (though I think reviewing what's in daf_butler at least should probably happen first, and that will take a while on its own).

            There are four intertwined things happening in daf_butler here:

            • Refactor the implementation of collections in the Registry to the Manager+Records objects model described on the prototype page.
            • Fundamentally change what a "run" is, as described on RFC-663 (also mentioned on the prototype page, but the RFC-663 is newer and fleshes out some details).
            • Add a new CHAINED collection type, as described on RFC-663.
            • Make it so Butler and Registry searches can handle multiple collections at once, moving that logic out of ctrl_mpexec/pipe_base (work originally planned for DM-19617).

            From the perspective of daf_butler alone, each of these could have been done on a separate tickets, but they each represent a different disruptive change to downstream package (especially ctrl_mpexec), and I didn't want to go through four of those.  So while the commits in daf_butler are somewhat split into separate ranges of commits (see PR), the changes to downstream packages were made to reflect only the final state of daf_butler.  Those changes are trivial in most packages, but the changes in obs_base (gen2to3) and ctrl_mpexec (changes to command-line arguments involving collections, as per RFC-663) are not, though those are still < 10% of the daf_butler changes.

            The final status, by package:

            • daf_butler: probably done enough to merge once other packages are working with it.  Ideally we'd add some unit tests specifically targeting the new code in wildcards.py - that gets a lot of free coverage from usage in higher-level, well-tested code, but it's entirely possible I missed some edge cases.  But that's the kind of thing that wouldn't be terrible to defer to another ticket in order to move this one along.  As noted above, the correspondence between commits and features will be documented on the PR.
            • obs_base: will conflict with John Parejko's DM-22655 but I'm pretty confident I can resolve those without much trouble when the time comes, given that I know a fair bit about both.  The big change here is to always ingest directly into a RUN collection (updating config names to reflect that), and then define a CHAINED collection that maps better the parent-child repo linkage in Gen2, and avoid TAGGED collections entirely.  These changes also mean that if DM-22655 lands first, this ticket will probably need an obs_decam branch similar to the obs_subaru one to update some config overrides.
            • ctrl_mpexec: passes tests locally, but not done.  See the message on the last git commit for more, but it mostly needs more tests and some API docs for new classes.  Real testing will require running ci_hsc_gen3, which I have not tried to do, and which I expect to require (trivial) changes to the command-line arguments in the shell script that invokes pipetask.
            • pipe_base: trivial changes to GraphBuilder to reflect the downstream changes in how we pass collections in when building QuantumGraphs.
            • obs_subaru: trivial config changes to adapt to obs_base changes.
            • ci_hsc_gen2: trivial changes to adapt to daf_butler and obs_base changes.

             

            Show
            jbosch Jim Bosch added a comment - Tim Jenness , this isn't quite done (more on that below), but the vast majority of it is ready for review, and as requested I'll try to describe below what's left to do well enough that someone else could take it over (though I think reviewing what's in daf_butler at least should probably happen first, and that will take a while on its own). There are four intertwined things happening in daf_butler here: Refactor the implementation of collections in the Registry to the Manager+Records objects model described on the prototype page . Fundamentally change what a "run" is, as described on RFC-663 (also mentioned on the prototype page, but the RFC-663 is newer and fleshes out some details). Add a new CHAINED collection type, as described on RFC-663 . Make it so Butler and Registry searches can handle multiple collections at once, moving that logic out of ctrl_mpexec/pipe_base (work originally planned for DM-19617 ). From the perspective of daf_butler alone, each of these could have been done on a separate tickets, but they each represent a different disruptive change to downstream package (especially ctrl_mpexec), and I didn't want to go through four of those.  So while the commits in daf_butler are somewhat split into separate ranges of commits (see PR), the changes to downstream packages were made to reflect only the final state of daf_butler.  Those changes are trivial in most packages, but the changes in obs_base (gen2to3) and ctrl_mpexec (changes to command-line arguments involving collections, as per RFC-663 ) are not, though those are still < 10% of the daf_butler changes. The final status, by package: daf_butler: probably done enough to merge once other packages are working with it.  Ideally we'd add some unit tests specifically targeting the new code in wildcards.py - that gets a lot of free coverage from usage in higher-level, well-tested code, but it's entirely possible I missed some edge cases.  But that's the kind of thing that wouldn't be terrible to defer to another ticket in order to move this one along.  As noted above, the correspondence between commits and features will be documented on the PR. obs_base: will conflict with John Parejko 's DM-22655 but I'm pretty confident I can resolve those without much trouble when the time comes, given that I know a fair bit about both.  The big change here is to always ingest directly into a RUN collection (updating config names to reflect that), and then define a CHAINED collection that maps better the parent-child repo linkage in Gen2, and avoid TAGGED collections entirely.  These changes also mean that if DM-22655 lands first, this ticket will probably need an obs_decam branch similar to the obs_subaru one to update some config overrides. ctrl_mpexec: passes tests locally, but not done.  See the message on the last git commit for more, but it mostly needs more tests and some API docs for new classes.   Real testing will require running ci_hsc_gen3, which I have not tried to do, and which I expect to require (trivial) changes to the command-line arguments in the shell script that invokes pipetask . pipe_base: trivial changes to GraphBuilder to reflect the downstream changes in how we pass collections in when building QuantumGraphs. obs_subaru: trivial config changes to adapt to obs_base changes. ci_hsc_gen2: trivial changes to adapt to daf_butler and obs_base changes.  
            Hide
            tjenness Tim Jenness added a comment -

            I've reviewed the daf_butler and obs_base changes.

            I've rebased this onto newly-merged DM-22655 and tested it with obs_subaru and obs_decam. I've also fixed ci_hsc_gen3. I will take a look at ci_hsc_gen2.

            Show
            tjenness Tim Jenness added a comment - I've reviewed the daf_butler and obs_base changes. I've rebased this onto newly-merged DM-22655 and tested it with obs_subaru and obs_decam. I've also fixed ci_hsc_gen3. I will take a look at ci_hsc_gen2.
            Hide
            salnikov Andy Salnikov added a comment -

            I looked at pipe_base and ctrl_mpexec, looks OK, some minor comments on PRs.

            Show
            salnikov Andy Salnikov added a comment - I looked at pipe_base and ctrl_mpexec, looks OK, some minor comments on PRs.
            Hide
            tjenness Tim Jenness added a comment -

            Everything does seem to build fine and all tests pass.

            Show
            tjenness Tim Jenness added a comment - Everything does seem to build fine and all tests pass.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.