Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32376

ap_verify gen3 fails to find jointcal_photoCalib dataset in graph generation

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: pipe_tasks
    • Labels:
      None
    • Story Points:
      1
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      ap_verify gen3 failed last night with the following error:

      RuntimeError: 2 dataset(s) of type 'jointcal_photoCalib' was/were present in a previous query, but could not be found now.This is either a logic bug in QuantumGraph generation or the input collections have been modified since QuantumGraph generation began.
      

      It seems likely that this was caused by DM-31491, which just landed and at a minimum reordered the dimensions for this dataset. https://github.com/lsst/pipe_tasks/commit/e5427992456a875016af14c907ea1f8bd786fb6f#diff-d0c181586a7ef6ce433ff7baee9b2fa03ab7de10015e6cafa57741a90a424d2dL77-R94

      I believe that there is also a typo later in the same PR: https://github.com/lsst/pipe_tasks/commit/e5427992456a875016af14c907ea1f8bd786fb6f#diff-d0c181586a7ef6ce433ff7baee9b2fa03ab7de10015e6cafa57741a90a424d2dR369

      Fixing this before the weekly would be highly desirable.

        Attachments

          Issue Links

            Activity

            Hide
            erykoff Eli Rykoff added a comment -

            Looking at this, I notice that the dimensions of the new fake externalSkyWcsTractCatalog https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/processCcdWithFakes.py#L73-L79
            are different than the dimensions of the old externalSkyWcsTractCatalog
            https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/makeCoaddTempExp.py#L617-L623

            And I think this might be the problem. The original one is missing skymap as a dimension; does this mean that this is wrong?

            Show
            erykoff Eli Rykoff added a comment - Looking at this, I notice that the dimensions of the new fake externalSkyWcsTractCatalog https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/processCcdWithFakes.py#L73-L79 are different than the dimensions of the old externalSkyWcsTractCatalog https://github.com/lsst/pipe_tasks/blob/master/python/lsst/pipe/tasks/makeCoaddTempExp.py#L617-L623 And I think this might be the problem. The original one is missing skymap as a dimension; does this mean that this is wrong?
            Hide
            erykoff Eli Rykoff added a comment -

            Though Jim says this will be expanded and compare as equal so maybe a red herring.

            Show
            erykoff Eli Rykoff added a comment - Though Jim says this will be expanded and compare as equal so maybe a red herring.
            Hide
            jbosch Jim Bosch added a comment -

            This exception is not one that should happen from regular user error. The fact that it does was already reported in DM-32058, and was supposed to be fixed on DM-31769.

            Show
            jbosch Jim Bosch added a comment - This exception is not one that should happen from regular user error. The fact that it does was already reported in DM-32058 , and was supposed to be fixed on DM-31769 .
            Hide
            sophiereed Sophie Reed added a comment - - edited

            I think the dimensions were wrong before and didn't work. In lots of testing this seemed to be the best solution to make it work with jointcal and fgcm. At least the dimensions in processCcdWithFakes, if I didn't have skymap in there then I got an error about not having skymap and the dimensions not matching those in the registry.

            Show
            sophiereed Sophie Reed added a comment - - edited I think the dimensions were wrong before and didn't work. In lots of testing this seemed to be the best solution to make it work with jointcal and fgcm. At least the dimensions in processCcdWithFakes, if I didn't have skymap in there then I got an error about not having skymap and the dimensions not matching those in the registry.
            Hide
            sophiereed Sophie Reed added a comment - - edited

            I found the bug and have tested ap_verify. I pulled ap_verify and ap_verify_ci_cosmos_pdr2 and then ran:

            ap_verify.py --dataset ap_verify_ci_cosmos_pdr2 --gen3 --output /project/sreed/apVerifyPdr2/ -p pipelines/ApVerifyWithFakes.yaml

            from the ap_verify_ci_cosmos_pdr2 repo.

            I also have a jenkins run going: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/35271/pipeline

            I ran: 

            pipetask run -b /repo/main/butler.yaml -i HSC/runs/RC2/w_2021_42/DM-32248 -p /home/sr525/repos/obs_subaru/pipelines/DRPFakes.yaml -d "instrument='HSC' and skymap='hsc_rings_v1' and (band='i') and (detector=82 or detector=81 or detector=80) and (visit=1228 or visit=1230)" -i u/sr525/ssiCats -o u/sr525/fakesTest6 --register-dataset-types

            which completed.

            These were all run with: 

            source /project/yusra/lsst_stack/loadLSST.bash
            setup lsst_distrib -t d_2021_10_25

            Show
            sophiereed Sophie Reed added a comment - - edited I found the bug and have tested ap_verify. I pulled ap_verify and ap_verify_ci_cosmos_pdr2 and then ran: ap_verify.py --dataset ap_verify_ci_cosmos_pdr2 --gen3 --output /project/sreed/apVerifyPdr2/ -p pipelines/ApVerifyWithFakes.yaml from the ap_verify_ci_cosmos_pdr2 repo. I also have a jenkins run going: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/35271/pipeline I ran:  pipetask run -b /repo/main/butler.yaml -i HSC/runs/RC2/w_2021_42/ DM-32248 -p /home/sr525/repos/obs_subaru/pipelines/DRPFakes.yaml -d "instrument='HSC' and skymap='hsc_rings_v1' and (band='i') and (detector=82 or detector=81 or detector=80) and (visit=1228 or visit=1230)" -i u/sr525/ssiCats -o u/sr525/fakesTest6 --register-dataset-types which completed. These were all run with:  source /project/yusra/lsst_stack/loadLSST.bash setup lsst_distrib -t d_2021_10_25
            Hide
            lskelvin Lee Kelvin added a comment -

            Thanks Sophie, and thanks Chris for your help too. This ticket removes unnecessary wcs and photoCalib code from processCcdWithFakes.py, and corrects a problematic typo that crops up further down the line. I can confirm that I'm able to successfully run both ap_verify and DRPFakes with this ticket.

            For testing, I first set up:

            setup lsst_distrib -t w_2021_43
            setup -j -r ap_verify
            setup -j -r ap_verify_ci_cosmos_pdr2
            setup -j -r pipe_tasks
            setup -j -r obs_subaru
            setup -j -r ci_hsc_gen3
            

            (all on their master branches). I believe these are all the repos which were touched since DM-31491 dropped, plus the required ap_verify repos.

            Next, I confirm that running on 'master':

            ap_verify.py -j 8 \
            --dataset ap_verify_ci_cosmos_pdr2 \
            --output /project/lskelvin/apverify/run/ \
            --gen3 \
            --data-query "visit in (59150, 59160) and band='g'" \
            -p $AP_VERIFY_CI_COSMOS_PDR2_DIR/pipelines/ApVerifyWithFakes.yaml \
            2>&1 | tee log-ApVerifyWithFakes.log
            

            fails with the above joincal error message.

            Switching pipe_tasks from 'master' to 'tickets/DM-32376' and then re-running now allows for ap_verify to complete successfully, with outputs as expected.

            I also ran on this ticket branch:

            pipetask --long-log run --register-dataset-types \
            -b /repo/main \
            -i HSC/runs/RC2/w_2021_42/DM-32248,u/sr525/ssiCats \
            -o u/lskelvin/ap_verify_fakes_test_ticket \
            -p $OBS_SUBARU_DIR/pipelines/DRPFakes.yaml \
            -d "instrument='HSC' AND skymap='hsc_rings_v1' AND band='i' AND detector=1 AND visit=1228" \
            2>&1 | tee log-DRPFakes-ticket.log
            

            which also appears to complete successfully, producing outputs as expected.

            Assuming Jenkins doesn't produce any red flags, then I think this is good to merge. Thanks!

            Show
            lskelvin Lee Kelvin added a comment - Thanks Sophie, and thanks Chris for your help too. This ticket removes unnecessary wcs and photoCalib code from processCcdWithFakes.py , and corrects a problematic typo that crops up further down the line. I can confirm that I'm able to successfully run both ap_verify and DRPFakes with this ticket. For testing, I first set up: setup lsst_distrib -t w_2021_43 setup -j -r ap_verify setup -j -r ap_verify_ci_cosmos_pdr2 setup -j -r pipe_tasks setup -j -r obs_subaru setup -j -r ci_hsc_gen3 (all on their master branches). I believe these are all the repos which were touched since DM-31491 dropped, plus the required ap_verify repos. Next, I confirm that running on 'master': ap_verify.py -j 8 \ --dataset ap_verify_ci_cosmos_pdr2 \ --output /project/lskelvin/apverify/run/ \ --gen3 \ --data-query "visit in (59150, 59160) and band='g'" \ -p $AP_VERIFY_CI_COSMOS_PDR2_DIR/pipelines/ApVerifyWithFakes.yaml \ 2>&1 | tee log-ApVerifyWithFakes.log fails with the above joincal error message. Switching pipe_tasks from 'master' to 'tickets/ DM-32376 ' and then re-running now allows for ap_verify to complete successfully, with outputs as expected. I also ran on this ticket branch: pipetask --long-log run --register-dataset-types \ -b /repo/main \ -i HSC/runs/RC2/w_2021_42/DM-32248,u/sr525/ssiCats \ -o u/lskelvin/ap_verify_fakes_test_ticket \ -p $OBS_SUBARU_DIR/pipelines/DRPFakes.yaml \ -d "instrument='HSC' AND skymap='hsc_rings_v1' AND band='i' AND detector=1 AND visit=1228" \ 2>&1 | tee log-DRPFakes-ticket.log which also appears to complete successfully, producing outputs as expected. Assuming Jenkins doesn't produce any red flags, then I think this is good to merge. Thanks!
            Hide
            yusra Yusra AlSayyad added a comment -

            I see a backport was requested from Tim. This ticket is a bugfix for DM-31491 which was merged the night before and broke ap_verify. Backporting this without DM-31491 does not make sense.

            Show
            yusra Yusra AlSayyad added a comment - I see a backport was requested from Tim. This ticket is a bugfix for DM-31491 which was merged the night before and broke ap_verify. Backporting this without DM-31491 does not make sense.
            Hide
            tjenness Tim Jenness added a comment -

            I only added the request when we thought that ci_hsc_gen3 was broken and this branch had the fix in it. Now that we know that ci_hsc_gen3 had the tag in the wrong place it is no longer important.

            Show
            tjenness Tim Jenness added a comment - I only added the request when we thought that ci_hsc_gen3 was broken and this branch had the fix in it. Now that we know that ci_hsc_gen3 had the tag in the wrong place it is no longer important.
            Hide
            nlust Nate Lust added a comment - - edited

            Tim is correct that ci_hsc_gen3 is now able to run after the pin move, however a problem still remains. If someone uses v23 obs_subaru fakes pipeline in the way that ci_hsc_gen3 (on the newer ref) was trying to,  the system will fail. Moving the pin on ci_hsc_gen3 back only works because it reverts it to a state where it is not using the obs_subaru fakes pipeline that exists in v23, but was instead using it's own fakes pipeline.

             

            Specifically we need this line from this change on pipe_tasks: https://github.com/lsst/pipe_tasks/commit/80aa66dc92c83c1e284b116ae07675d6559376d3#diff-7daf2a8d292dc80b5c277b573b18346f30955635071479a011e78be5e76531b3R315

            There is some confusion if this was the right ticket because of a past rebase issue, but that does not matter so much as getting the change set onto v23.

            Show
            nlust Nate Lust added a comment - - edited Tim is correct that ci_hsc_gen3 is now able to run after the pin move, however a problem still remains. If someone uses v23 obs_subaru fakes pipeline in the way that ci_hsc_gen3 (on the newer ref) was trying to,  the system will fail. Moving the pin on ci_hsc_gen3 back only works because it reverts it to a state where it is not using the obs_subaru fakes pipeline that exists in v23, but was instead using it's own fakes pipeline.   Specifically we need this line from this change on pipe_tasks: https://github.com/lsst/pipe_tasks/commit/80aa66dc92c83c1e284b116ae07675d6559376d3#diff-7daf2a8d292dc80b5c277b573b18346f30955635071479a011e78be5e76531b3R315 There is some confusion if this was the right ticket because of a past rebase issue, but that does not matter so much as getting the change set onto v23.

              People

              Assignee:
              sophiereed Sophie Reed
              Reporter:
              ktl Kian-Tat Lim
              Reviewers:
              Lee Kelvin
              Watchers:
              Eli Rykoff, Jim Bosch, Kian-Tat Lim, Lee Kelvin, Nate Lust, Sophie Reed, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.