Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32034

Create MatchProbabilistic (Pipeline)Task

    XMLWordPrintable

    Details

    • Story Points:
      5
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Following DM-31781, wrap the ProbabilisticMatcher as a (Pipeline)Task generating DataFrames.

        Attachments

          Issue Links

            Activity

            Hide
            dtaranu Dan Taranu added a comment - - edited

            I experimented with using pydantic for configuring the matcher, but this proved to have no real benefits over pex_config, so I settled on the following structure:

            meas_astrom:

            • matcher_probabilistic.py: MatchProbabilisticConfig and MatcherProbabilistic (a plain old class). The most generic matching functions with no stack dependencies, only pex_config, so this could be spun off at some point (I don't know if there's a good way now).
            • match_probabilistic.py: MatchProbabilisticTask, including all stack dependencies. Also has a single test case in tests/test_match_probabilistic_task.py. Could maybe use more to cover other config options.

            pipe_tasks:

            • match_tract_catalog.py: MatchTractCatalogConfig/Task pipeline task with configurable abstract MatchTractCatalogSubConfig/Task classes, in case someone wants to use/implement another matcher.
            • match_tract_catalog_probabilistic.py: MatchTractCatalogProbabilisticConfig/Task classes that inherit from both MatchTractCatalogSubConfig/Task and MatchProbabilisticConfig/Task.

            I haven't added any tests to pipe_tasks yet; I'm not sure if they're warranted. If I were to do so, I'd probably copypasta from the existing test_matchFakes.py and/or test_parquet.py since that is the intended use.

            To verify that this works, I ran:

            pipetask --long-log run -b /repo/dc2 --input 2.2i/runs/test-med-1/w_2021_40/DM-32024,2.2i/truth_summary --output "u/dtaranu/DM-32034/w_2021_40_match_0.5asec" -d "instrument = 'LSSTCam-imSim' and tract=3828 and skymap='DC2'" -p /project/dtaranu/dc2/match_catalog_pipe/match_tract_catalog_probabilistic.yaml --register-dataset-types

            The associated yaml/config is valid for w40 and probably w44/6 and will eventually go into DRP pipelines, either after objectTable_tract generation, or as part of faro (TBD). In principle this could beusedto match against an observed reference catalog, not just a truth catalog (e.g. an HST-derived COSMOS catalog for HSC UDeep), but I haven't implemented using uncertainties in the reference catalog yet. I imagine I'd just add them in quadrature to each target source's errors.

            Once the matcher is being run as part of DRP, I'll update the notebook from DM-31781 to use the generated datasets to make DC2 vs truth plots.

            Show
            dtaranu Dan Taranu added a comment - - edited I experimented with using pydantic for configuring the matcher, but this proved to have no real benefits over pex_config, so I settled on the following structure: meas_astrom: matcher_probabilistic.py: MatchProbabilisticConfig and MatcherProbabilistic (a plain old class). The most generic matching functions with no stack dependencies, only pex_config, so this could be spun off at some point (I don't know if there's a good way now). match_probabilistic.py: MatchProbabilisticTask, including all stack dependencies. Also has a single test case in tests/test_match_probabilistic_task.py. Could maybe use more to cover other config options. pipe_tasks: match_tract_catalog.py: MatchTractCatalogConfig/Task pipeline task with configurable abstract MatchTractCatalogSubConfig/Task classes, in case someone wants to use/implement another matcher. match_tract_catalog_probabilistic.py: MatchTractCatalogProbabilisticConfig/Task classes that inherit from both MatchTractCatalogSubConfig/Task and MatchProbabilisticConfig/Task. I haven't added any tests to pipe_tasks yet; I'm not sure if they're warranted. If I were to do so, I'd probably copypasta from the existing test_matchFakes.py and/or test_parquet.py since that is the intended use. To verify that this works, I ran: pipetask --long-log run -b /repo/dc2 --input 2.2i/runs/test-med-1/w_2021_40/ DM-32024 ,2.2i/truth_summary --output "u/dtaranu/ DM-32034 /w_2021_40_match_0.5asec" -d "instrument = 'LSSTCam-imSim' and tract=3828 and skymap='DC2'" -p /project/dtaranu/dc2/match_catalog_pipe/match_tract_catalog_probabilistic.yaml --register-dataset-types The associated yaml/config is valid for w40 and probably w44/6 and will eventually go into DRP pipelines, either after objectTable_tract generation, or as part of faro (TBD). In principle this could beusedto match against an observed reference catalog, not just a truth catalog (e.g. an HST-derived COSMOS catalog for HSC UDeep), but I haven't implemented using uncertainties in the reference catalog yet. I imagine I'd just add them in quadrature to each target source's errors. Once the matcher is being run as part of DRP, I'll update the notebook from DM-31781 to use the generated datasets to make DC2 vs truth plots.
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            Looks good mostly. My main suggest is breaking up some of the code in the match method into sub methods to make the calculations more clear. Also adding inline comments would help.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - Looks good mostly. My main suggest is breaking up some of the code in the match method into sub methods to make the calculations more clear. Also adding inline comments would help.
            Hide
            dtaranu Dan Taranu added a comment -

            Passed Jenkins after yet another pipe_tasks merge.

            Thanks for the feedback. I added more inline comments. I don't really see a good way to refactor the method, because most of it is setting up variables that need to stay in scope for the inner loop. See if you agree based on the comments I added.

            Show
            dtaranu Dan Taranu added a comment - Passed Jenkins after yet another pipe_tasks merge. Thanks for the feedback. I added more inline comments. I don't really see a good way to refactor the method, because most of it is setting up variables that need to stay in scope for the inner loop. See if you agree based on the comments I added.
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            Okay. The only comment I have is comparing to data with NaN columns. From experience, you could have it that one of your bands has a NaN for a given value while all the rest of them are fine. If you want to add this functionality (assuming I've read the code correctly.) I'm happy to have it as another ticket.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - Okay. The only comment I have is comparing to data with NaN columns. From experience, you could have it that one of your bands has a NaN for a given value while all the rest of them are fine. If you want to add this functionality (assuming I've read the code correctly.) I'm happy to have it as another ticket.
            Hide
            dtaranu Dan Taranu added a comment -

            That already exists with the match_n_finite_min field in MatchProbabilisticConfig. For example, if you're matching on two spatial coordinates and three fluxes, you can set it to 4 to allow one NaN chi value (presumably a flux). For that matter, you can also choose to compute chi without using spatial coordinates at all, although I'm not sure why anyone would want to.

            If someone wants more fine-grained control of which columns require measurements, like say requiring g + r fluxes but not u or izy, I think that would require more config fields, but I didn't see a need for that feature at the moment.

            Show
            dtaranu Dan Taranu added a comment - That already exists with the match_n_finite_min field in MatchProbabilisticConfig . For example, if you're matching on two spatial coordinates and three fluxes, you can set it to 4 to allow one NaN chi value (presumably a flux). For that matter, you can also choose to compute chi without using spatial coordinates at all, although I'm not sure why anyone would want to. If someone wants more fine-grained control of which columns require measurements, like say requiring g + r fluxes but not u or izy, I think that would require more config fields, but I didn't see a need for that feature at the moment.

              People

              Assignee:
              dtaranu Dan Taranu
              Reporter:
              dtaranu Dan Taranu
              Reviewers:
              Chris Morrison [X] (Inactive)
              Watchers:
              Chris Morrison [X] (Inactive), Dan Taranu, Nima Sedaghat
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.