Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21877

Create "marker" Butler dataset for PPDB

    Details

      Description

      Discussing the Gen 3 migration of lsst.ap.pipe.ApPipeTask and lsst.verify.tasks.PpdbMetricTask on #dm-science-pipelines, Jim Bosch and I concluded that the cleanest trigger for running a PpdbMetricTask would be to provide a dataset that indicates that the database contains all information for a particular data ID. I believe this can also simplify the problem of providing Butler-like input to PpdbMetricTask.

      Prototype this sytem in a Gen 2 pipeline by doing the following:

      • Create a new dataset type (ppdbConfig?), with visit+ccd granularity (matching the units of work of ApPipeTask).
      • Make ApPipeTask dump its PPDB config at the end of DB-related processing for a particular data ID. Note that while the DB config is fixed, we still should have a dataset for each data ID so that Gen 3 code could know which units of processing are "DB complete".
      • Create a new Task for creating a PPDB from a ppdbConfig, then configure PpdbMetricTask to use it in ap_verify runs.

      These steps should test all aspects of the system except for tracking dependencies between pipeline and metric tasks, which is a Gen 3-only feature.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment -

            This ticket should also review the dimensionality of lsst.verify.tasks.PpdbMetricConnections. The only extant concrete PpdbMetricTask, lsst.ap.association.metrics.TotalUnassociatedDiaObjectsMetricTask, runs at collection granularity and produces a collection-grained measurement, but finer-grained database metrics are possible. Adding fine-grained inputs, as proposed above, is a good chance to make sure the outputs do everything we want.

            Show
            krzys Krzysztof Findeisen added a comment - This ticket should also review the dimensionality of lsst.verify.tasks.PpdbMetricConnections . The only extant concrete PpdbMetricTask , lsst.ap.association.metrics.TotalUnassociatedDiaObjectsMetricTask , runs at collection granularity and produces a collection-grained measurement, but finer-grained database metrics are possible. Adding fine-grained inputs, as proposed above, is a good chance to make sure the outputs do everything we want.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Jim Bosch, in the Slack discussion you said:

            I think Gen2 and Gen3 already both support get/put for nested dict <-> yaml, which seems like it'd be a good choice for that.

            If in the Gen 2 datasets.yaml I put

            persistable: Config
            storage: ConfigStorage
            

            will that use YAML or otherwise be forward-compatible with Gen 3? I'm a bit confused over whether or not we've migrated away from Python code as our config format.

            Show
            krzys Krzysztof Findeisen added a comment - Jim Bosch , in the Slack discussion you said: I think Gen2 and Gen3 already both support get / put for nested dict <-> yaml, which seems like it'd be a good choice for that. If in the Gen 2 datasets.yaml I put persistable: Config storage: ConfigStorage will that use YAML or otherwise be forward-compatible with Gen 3? I'm a bit confused over whether or not we've migrated away from Python code as our config format.
            Hide
            jbosch Jim Bosch added a comment -

            pex_config is still around, and is definitely still Python code rather than yaml.  And I'm pretty sure those Gen2 declarations correspond to pex_config.

            But we have started moving towards YAML for things other than pipeline/task configuration (starting with the Gen2 butler's policy files of the sort you're editing).  I assume the metadata datasets would provide an example of how to configure a YAML dataset in Gen2.  Gen3 configures file formats quite differently, so no need to worry about forward compatibility.

            Show
            jbosch Jim Bosch added a comment - pex_config is still around, and is definitely still Python code rather than yaml.  And I'm pretty sure those Gen2 declarations correspond to pex_config. But we have started moving towards YAML for things other than pipeline/task configuration (starting with the Gen2 butler's policy files of the sort you're editing).  I assume the metadata datasets would provide an example of how to configure a YAML dataset in Gen2.  Gen3 configures file formats quite differently, so no need to worry about forward compatibility.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Hi Gabor Kovacs, would you be willing to review this work? It's about 240 lines, most of them in verify and ap_pipe.

            Show
            krzys Krzysztof Findeisen added a comment - - edited Hi Gabor Kovacs , would you be willing to review this work? It's about 240 lines, most of them in verify and ap_pipe .
            Hide
            gkovacs Gabor Kovacs added a comment - - edited

            The obs_* packages look good. "apdb_marker" dataset definitions are just for Gen2 compatibility, right?

             

            /verify_DM-21877/tests/test_directPpdbLoader.py: 47: _dummyPpdbConfig -> static method; Do we usually call static methods via an instance self._dummyPpdbConfig() ?


            Ouch.... Perhaps due to my sluggish response to the review request, it seems that concurrent changes have happened here by the merging of DM-22039.
            Would you please rebase it to master? I think there will be places where a 3-way diff/merge is needed among the "common base", "this (DM-21877)" and "master".

            There are a few lines with concurrent changes and a few files that was changed here in DM-21877 while was renamed in the meantime (and also changed) in DM-22039.

            I spotted the following:

            • obs_base/policy/datasets.yaml -> now Apdb
            • ap_pipe.py: lines 212; 300-310 -> manual merge needed
            • ap_verify_DM-21877/config/default_dataset_metrics.py -> manual merge needed
            • verify_DM-21877/doc/lsst.verify/tasks/lsst.verify.tasks.PpdbMetricTask.rst : in the meantime this file was removed from master. Apparently moved to ApdbMetricTask.rst -> please check contents; e.g. changes in line 20-21
            • verify_DM-21877/python/lsst/verify/tasks/ppdbMetricTask.py Apparently moved to apdbMetricTask.py -> please check contents; e.g. changes in line 22 and changes in PpdbMetricConnections/ApdbMetricConnections.
            • /ssd/gkovacs/devel/active/verify_DM-21877/doc/lsst.verify/tasks/lsst.verify.tasks.DirectPpdbLoader.rst -> needs updating to the "Apdb" terminology

             

             

             

            Show
            gkovacs Gabor Kovacs added a comment - - edited The obs_* packages look good. "apdb_marker" dataset definitions are just for Gen2 compatibility, right?   /verify_ DM-21877 /tests/test_directPpdbLoader.py: 47: _dummyPpdbConfig -> static method; Do we usually call static methods via an instance self._dummyPpdbConfig() ? Ouch.... Perhaps due to my sluggish response to the review request, it seems that concurrent changes have happened here by the merging of DM-22039 . Would you please rebase it to master? I think there will be places where a 3-way diff/merge is needed among the "common base", "this ( DM-21877 )" and "master". There are a few lines with concurrent changes and a few files that was changed here in DM-21877 while was renamed in the meantime (and also changed) in DM-22039 . I spotted the following: obs_base/policy/datasets.yaml -> now Apdb ap_pipe.py: lines 212; 300-310 -> manual merge needed ap_verify_ DM-21877 /config/default_dataset_metrics.py -> manual merge needed verify_ DM-21877 /doc/lsst.verify/tasks/lsst.verify.tasks.PpdbMetricTask.rst : in the meantime this file was removed from master. Apparently moved to ApdbMetricTask.rst -> please check contents; e.g. changes in line 20-21 verify_ DM-21877 /python/lsst/verify/tasks/ppdbMetricTask.py Apparently moved to apdbMetricTask.py -> please check contents; e.g. changes in line 22 and changes in PpdbMetricConnections/ApdbMetricConnections. /ssd/gkovacs/devel/active/verify_ DM-21877 /doc/lsst.verify/tasks/lsst.verify.tasks.DirectPpdbLoader.rst -> needs updating to the "Apdb" terminology      
            Hide
            krzys Krzysztof Findeisen added a comment -

            See Slack for my response re: rebasing.

            In the future, please put all code comments on the GitHub pull request, per long-standing policy. Comments in Jira are much harder to track, discuss, or respond to.

            Show
            krzys Krzysztof Findeisen added a comment - See Slack for my response re: rebasing. In the future, please put all code comments on the GitHub pull request, per long-standing policy . Comments in Jira are much harder to track, discuss, or respond to.
            Hide
            gkovacs Gabor Kovacs added a comment -

            We've gone through the concepts behind this change, so everything looks good. Sorry for the bad timing leading to concurrent changes.

            Show
            gkovacs Gabor Kovacs added a comment - We've gone through the concepts behind this change, so everything looks good. Sorry for the bad timing leading to concurrent changes.

              People

              • Assignee:
                krzys Krzysztof Findeisen
                Reporter:
                krzys Krzysztof Findeisen
                Reviewers:
                Gabor Kovacs
                Watchers:
                Gabor Kovacs, Jim Bosch, Krzysztof Findeisen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel