Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32248

Gen3 RC2 reprocessing with bps and w_2021_42

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Submit directory: /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42

        Attachments

          Issue Links

            Activity

            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            There was just a single error in step1, in characterizeImage:

            jobs/characterizeImage/36234/58733_characterizeImage_36234_24.5886370.err:    raise RuntimeError("Unable to measure aperture correction for required algorithm '%s': "
            jobs/characterizeImage/36234/58733_characterizeImage_36234_24.5886370.err:RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel': only 1 sources, but require at least 2.
            

            I haven't seen this kind of error since DC2 w28.

            For step4, there were 32 failures, all in imageDifference. Some errors/exceptions had to do with KernelCandidacy and other various kernel things, examples of which include:

            24538/68380_imageDifference_24538_103.6176959.err:    raise RuntimeError("Cannot find any objects suitable for KernelCandidacy")
            24538/68380_imageDifference_24538_103.6176959.err:RuntimeError: Cannot find any objects suitable for KernelCandidacy

            jobs/imageDifference/11734/64869_imageDifference_11734_93.6174150.err
            lsst::pex::exceptions::Exception: 'Original kernel does not exist {0}; Visiting candidate {1}'

            jobs/imageDifference/24536/66398_imageDifference_24536_102.6164279.err lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates'
            

            I've run into the above errors in almost every RC2 weekly since w26: w26, w30, w34, w38. I was under the impression that this might be an expected error, but it does show up quite often.

             
            The rest of the errors were like:

            1206/63914_imageDifference_1206_50.6178953.err:lsst.pex.exceptions.wrappers.InvalidParameterError:
            1206/63914_imageDifference_1206_50.6178953.err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.'
            

            which are already known of: DM-31777

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - There was just a single error in step1, in characterizeImage: jobs/characterizeImage/ 36234 /58733_characterizeImage_36234_24. 5886370 .err: raise RuntimeError( "Unable to measure aperture correction for required algorithm '%s': " jobs/characterizeImage/ 36234 /58733_characterizeImage_36234_24. 5886370 .err:RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel' : only 1 sources, but require at least 2 . I haven't seen this kind of error since DC2 w28. For step4, there were 32 failures, all in  imageDifference . Some errors/exceptions had to do with KernelCandidacy and other various kernel things, examples of which include: 24538 /68380_imageDifference_24538_103. 6176959 .err: raise RuntimeError( "Cannot find any objects suitable for KernelCandidacy" ) 24538 /68380_imageDifference_24538_103. 6176959 .err:RuntimeError: Cannot find any objects suitable for KernelCandidacy jobs/imageDifference/ 11734 /64869_imageDifference_11734_93. 6174150 .err lsst::pex::exceptions::Exception: 'Original kernel does not exist {0}; Visiting candidate {1}' jobs/imageDifference/ 24536 /66398_imageDifference_24536_102. 6164279 .err lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates' I've run into the above errors in almost every RC2 weekly since w26: w26, w30, w34, w38. I was under the impression that this might be an expected error, but it does show up quite often.   The rest of the errors were like: 1206 /63914_imageDifference_1206_50. 6178953 .err:lsst.pex.exceptions.wrappers.InvalidParameterError: 1206 /63914_imageDifference_1206_50. 6178953 .err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.' which are already known of:  DM-31777
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment - - edited

            step 5 had an error which was new to me. Just 1 in consolidateForcedSourceOnDiaObjectTable on tract 9813 (out of 3 tracts total):

            9813/86805_consolidateForcedSourceOnDiaObjectTable_9813_.6368577.err:numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.61 GiB for an array with shape (2, 510445768) and data type int64
            

            There was also 1 failed job in consolidateForcedSourceTable also on tract 9813:

            9813/731_consolidateForcedSourceTable_9813_.6368574.log: (0) Abnormal termination (signal 9)
            

            Usually, to solve the second error requires a simple increase to the memory request for that job, but after checking the memory usage for both of these then increasing the memory request for each accordingly, resubmission with the rescue file proved fruitless.

            Here's the path: /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42/DM-32248/20211025T202954Z

            Is it okay to leave those two jobs behind and proceed? Or is it worth it to try and submit them manually on the prod submit node with a much higher memory request?

            P.S. consolidateForcedSourceTable only needed 208MB and 598MB for tracts 9615 and 9697, respectively, whereas the failed tract wanted ~85000 on the first try, and then ~111000MB on the second try which I found a bit strange

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - - edited step 5 had an error which was new to me. Just 1 in  consolidateForcedSourceOnDiaObjectTable  on tract 9813 (out of 3 tracts total): 9813 /86805_consolidateForcedSourceOnDiaObjectTable_9813_. 6368577 .err:numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.61 GiB for an array with shape ( 2 , 510445768 ) and data type int64 There was also 1 failed job in  consolidateForcedSourceTable  also on tract 9813: 9813 /731_consolidateForcedSourceTable_9813_. 6368574 .log: ( 0 ) Abnormal termination (signal 9 ) Usually, to solve the second error requires a simple increase to the memory request for that job, but after checking the memory usage for both of these then increasing the memory request for each accordingly, resubmission with the rescue file proved fruitless. Here's the path:  /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42/ DM-32248 /20211025T202954Z Is it okay to leave those two jobs behind and proceed? Or is it worth it to try and submit them manually on the prod submit node with a much higher memory request? P.S.  consolidateForcedSourceTable  only needed 208MB and 598MB for tracts 9615 and 9697, respectively, whereas the failed tract wanted ~85000 on the first try, and then ~111000MB on the second try which I found a bit strange
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment - - edited

            In faro, the 3 wPerp jobs failed due to a recurring (and known) too-large memory request issue. I attempted to run those 3 jobs by hand with "pipetask run" which was successful:

            pipetask --long-log run -b /repo/main/butler.yaml -i "HSC/runs/RC2/w_2021_42/DM-32248" --output "HSC/runs/RC2/w_2021_42/DM-32248" -p $FARO_DIR/pipelines/metrics_pipeline.yaml#wPerp -d "skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813) AND band!='N921'" --extend-run --skip-init-writes --skip-existing

            My mistake was not only having the wrong inCollection (should've been HSC/RC2/defaults as it had been for the rest of the run), but I also ran those three jobs while 21 other jobs from the same step were still held in the queue from the original attempt when the 3 wPerp jobs initially failed.

            The fix for this SNAFU was to rerun the faro step with

            extraQgraphOptions: "--skip-existing-in HSC/runs/RC2/w_2021_42/DM-32248"

            in the submission yaml.

            This seems to have cleared things up, and it appears that all of the data which should be there are there. If someone discovers that something is missing, it's most likely due to this issue (Yusra AlSayyad ).

            Otherwise, I think I can wrap this weekly up at this point, and dispatch metrics for ingestion.

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - - edited In faro, the 3 wPerp jobs failed due to a recurring (and known) too-large memory request issue. I attempted to run those 3 jobs by hand with "pipetask run" which was successful: pipetask -- long -log run -b /repo/main/butler.yaml -i "HSC/runs/RC2/w_2021_42/DM-32248" --output "HSC/runs/RC2/w_2021_42/DM-32248" -p $FARO_DIR/pipelines/metrics_pipeline.yaml#wPerp -d "skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813) AND band!='N921'" --extend-run --skip-init-writes --skip-existing My mistake was not only having the wrong inCollection (should've been HSC/RC2/defaults as it had been for the rest of the run), but I also ran those three jobs while 21 other jobs from the same step were still held in the queue from the original attempt when the 3 wPerp jobs initially failed. The fix for this SNAFU was to rerun the faro step with extraQgraphOptions: "--skip-existing-in HSC/runs/RC2/w_2021_42/DM-32248" in the submission yaml. This seems to have cleared things up, and it appears that all of the data which should be there are there. If someone discovers that something is missing, it's most likely due to this issue ( Yusra AlSayyad ). Otherwise, I think I can wrap this weekly up at this point, and dispatch metrics for ingestion.
            Show
            yusra Yusra AlSayyad added a comment - Looking at the gen3 metrics ingested, https://chronograf-demo.lsst.codes/sources/2/dashboards/75?refresh=Paused&tempVars%5Btract%5D=9697&tempVars%5Bbutler_gen%5D=Gen3&tempVars%5Bband%5D=r&tempVars%5Bdataset%5D=HSC_RC2&tempVars%5Btime_start%5D=Use%20date%20picker&lower=2021-06-13T20%3A25%3A00.000Z&upper=2021-11-22T21%3A25%3A00.000Z# I only see bands g,r,i for AM1, AM2, PA1, TE1, and TE2. Would you check what happened to z and y?
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Yusra AlSayyad I see what you mean....investigating now.

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Yusra AlSayyad  I see what you mean....investigating now.
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Yusra AlSayyad With the help of Michelle Gower and Simon Krughoff, we discovered that the missing data were there in my repo, they just missed getting ingested due to an error on my part (I had two dispatches going at once which confused things). The missing data are now where they should be on the Chronograf!

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Yusra AlSayyad  With the help of Michelle Gower  and Simon Krughoff , we discovered that the missing data were there in my repo, they just missed getting ingested due to an error on my part (I had two dispatches going at once which confused things). The missing data are now where they should be on the Chronograf !

              People

              Assignee:
              brendal4 Brock Brendal [X] (Inactive)
              Reporter:
              brendal4 Brock Brendal [X] (Inactive)
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Brock Brendal [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.