Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21877

Create "marker" Butler dataset for PPDB

    XMLWordPrintable

Details

    Description

      Discussing the Gen 3 migration of lsst.ap.pipe.ApPipeTask and lsst.verify.tasks.PpdbMetricTask on #dm-science-pipelines, jbosch and I concluded that the cleanest trigger for running a PpdbMetricTask would be to provide a dataset that indicates that the database contains all information for a particular data ID. I believe this can also simplify the problem of providing Butler-like input to PpdbMetricTask.

      Prototype this sytem in a Gen 2 pipeline by doing the following:

      • Create a new dataset type (ppdbConfig?), with visit+ccd granularity (matching the units of work of ApPipeTask).
      • Make ApPipeTask dump its PPDB config at the end of DB-related processing for a particular data ID. Note that while the DB config is fixed, we still should have a dataset for each data ID so that Gen 3 code could know which units of processing are "DB complete".
      • Create a new Task for creating a PPDB from a ppdbConfig, then configure PpdbMetricTask to use it in ap_verify runs.

      These steps should test all aspects of the system except for tracking dependencies between pipeline and metric tasks, which is a Gen 3-only feature.

      Attachments

        Issue Links

          Activity

            This ticket should also review the dimensionality of lsst.verify.tasks.PpdbMetricConnections. The only extant concrete PpdbMetricTask, lsst.ap.association.metrics.TotalUnassociatedDiaObjectsMetricTask, runs at collection granularity and produces a collection-grained measurement, but finer-grained database metrics are possible. Adding fine-grained inputs, as proposed above, is a good chance to make sure the outputs do everything we want.

            krzys Krzysztof Findeisen added a comment - This ticket should also review the dimensionality of lsst.verify.tasks.PpdbMetricConnections . The only extant concrete PpdbMetricTask , lsst.ap.association.metrics.TotalUnassociatedDiaObjectsMetricTask , runs at collection granularity and produces a collection-grained measurement, but finer-grained database metrics are possible. Adding fine-grained inputs, as proposed above, is a good chance to make sure the outputs do everything we want.

            jbosch, in the Slack discussion you said:

            I think Gen2 and Gen3 already both support get/put for nested dict <-> yaml, which seems like it'd be a good choice for that.

            If in the Gen 2 datasets.yaml I put

            persistable: Config
            storage: ConfigStorage
            

            will that use YAML or otherwise be forward-compatible with Gen 3? I'm a bit confused over whether or not we've migrated away from Python code as our config format.

            krzys Krzysztof Findeisen added a comment - jbosch , in the Slack discussion you said: I think Gen2 and Gen3 already both support get / put for nested dict <-> yaml, which seems like it'd be a good choice for that. If in the Gen 2 datasets.yaml I put persistable: Config storage: ConfigStorage will that use YAML or otherwise be forward-compatible with Gen 3? I'm a bit confused over whether or not we've migrated away from Python code as our config format.
            jbosch Jim Bosch added a comment -

            pex_config is still around, and is definitely still Python code rather than yaml.  And I'm pretty sure those Gen2 declarations correspond to pex_config.

            But we have started moving towards YAML for things other than pipeline/task configuration (starting with the Gen2 butler's policy files of the sort you're editing).  I assume the metadata datasets would provide an example of how to configure a YAML dataset in Gen2.  Gen3 configures file formats quite differently, so no need to worry about forward compatibility.

            jbosch Jim Bosch added a comment - pex_config is still around, and is definitely still Python code rather than yaml.  And I'm pretty sure those Gen2 declarations correspond to pex_config. But we have started moving towards YAML for things other than pipeline/task configuration (starting with the Gen2 butler's policy files of the sort you're editing).  I assume the metadata datasets would provide an example of how to configure a YAML dataset in Gen2.  Gen3 configures file formats quite differently, so no need to worry about forward compatibility.
            krzys Krzysztof Findeisen added a comment - - edited

            Hi gkovacs, would you be willing to review this work? It's about 240 lines, most of them in verify and ap_pipe.

            krzys Krzysztof Findeisen added a comment - - edited Hi gkovacs , would you be willing to review this work? It's about 240 lines, most of them in verify and ap_pipe .

            The obs_* packages look good. "apdb_marker" dataset definitions are just for Gen2 compatibility, right?

             

            /verify_DM-21877/tests/test_directPpdbLoader.py: 47: _dummyPpdbConfig -> static method; Do we usually call static methods via an instance self._dummyPpdbConfig() ?


            Ouch.... Perhaps due to my sluggish response to the review request, it seems that concurrent changes have happened here by the merging of DM-22039.
            Would you please rebase it to master? I think there will be places where a 3-way diff/merge is needed among the "common base", "this (DM-21877)" and "master".

            There are a few lines with concurrent changes and a few files that was changed here in DM-21877 while was renamed in the meantime (and also changed) in DM-22039.

            I spotted the following:

            • obs_base/policy/datasets.yaml -> now Apdb
            • ap_pipe.py: lines 212; 300-310 -> manual merge needed
            • ap_verify_DM-21877/config/default_dataset_metrics.py -> manual merge needed
            • verify_DM-21877/doc/lsst.verify/tasks/lsst.verify.tasks.PpdbMetricTask.rst : in the meantime this file was removed from master. Apparently moved to ApdbMetricTask.rst -> please check contents; e.g. changes in line 20-21
            • verify_DM-21877/python/lsst/verify/tasks/ppdbMetricTask.py Apparently moved to apdbMetricTask.py -> please check contents; e.g. changes in line 22 and changes in PpdbMetricConnections/ApdbMetricConnections.
            • /ssd/gkovacs/devel/active/verify_DM-21877/doc/lsst.verify/tasks/lsst.verify.tasks.DirectPpdbLoader.rst -> needs updating to the "Apdb" terminology

             

             

             

            gkovacs Gabor Kovacs [X] (Inactive) added a comment - - edited The obs_* packages look good. "apdb_marker" dataset definitions are just for Gen2 compatibility, right?   /verify_ DM-21877 /tests/test_directPpdbLoader.py: 47: _dummyPpdbConfig -> static method; Do we usually call static methods via an instance self._dummyPpdbConfig() ? Ouch.... Perhaps due to my sluggish response to the review request, it seems that concurrent changes have happened here by the merging of DM-22039 . Would you please rebase it to master? I think there will be places where a 3-way diff/merge is needed among the "common base", "this ( DM-21877 )" and "master". There are a few lines with concurrent changes and a few files that was changed here in DM-21877 while was renamed in the meantime (and also changed) in DM-22039 . I spotted the following: obs_base/policy/datasets.yaml -> now Apdb ap_pipe.py: lines 212; 300-310 -> manual merge needed ap_verify_ DM-21877 /config/default_dataset_metrics.py -> manual merge needed verify_ DM-21877 /doc/lsst.verify/tasks/lsst.verify.tasks.PpdbMetricTask.rst : in the meantime this file was removed from master. Apparently moved to ApdbMetricTask.rst -> please check contents; e.g. changes in line 20-21 verify_ DM-21877 /python/lsst/verify/tasks/ppdbMetricTask.py Apparently moved to apdbMetricTask.py -> please check contents; e.g. changes in line 22 and changes in PpdbMetricConnections/ApdbMetricConnections. /ssd/gkovacs/devel/active/verify_ DM-21877 /doc/lsst.verify/tasks/lsst.verify.tasks.DirectPpdbLoader.rst -> needs updating to the "Apdb" terminology      

            See Slack for my response re: rebasing.

            In the future, please put all code comments on the GitHub pull request, per long-standing policy. Comments in Jira are much harder to track, discuss, or respond to.

            krzys Krzysztof Findeisen added a comment - See Slack for my response re: rebasing. In the future, please put all code comments on the GitHub pull request, per long-standing policy . Comments in Jira are much harder to track, discuss, or respond to.

            We've gone through the concepts behind this change, so everything looks good. Sorry for the bad timing leading to concurrent changes.

            gkovacs Gabor Kovacs [X] (Inactive) added a comment - We've gone through the concepts behind this change, so everything looks good. Sorry for the bad timing leading to concurrent changes.

            People

              krzys Krzysztof Findeisen
              krzys Krzysztof Findeisen
              Gabor Kovacs [X] (Inactive)
              Gabor Kovacs [X] (Inactive), Jim Bosch, Krzysztof Findeisen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.