# Gen3 RC2 reprocessing with bps and w_2021_42

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Team:
Data Facility
• Urgent?:
No

#### Description

Submit directory: /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42

#### Activity

Hide
Brock Brendal [X] (Inactive) added a comment -

There was just a single error in step1, in characterizeImage:

 jobs/characterizeImage/36234/58733_characterizeImage_36234_24.5886370.err: raise RuntimeError("Unable to measure aperture correction for required algorithm '%s': " jobs/characterizeImage/36234/58733_characterizeImage_36234_24.5886370.err:RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel': only 1 sources, but require at least 2. 

I haven't seen this kind of error since DC2 w28.

For step4, there were 32 failures, all in imageDifference. Some errors/exceptions had to do with KernelCandidacy and other various kernel things, examples of which include:

 24538/68380_imageDifference_24538_103.6176959.err: raise RuntimeError("Cannot find any objects suitable for KernelCandidacy") 24538/68380_imageDifference_24538_103.6176959.err:RuntimeError: Cannot find any objects suitable for KernelCandidacy

 jobs/imageDifference/11734/64869_imageDifference_11734_93.6174150.err lsst::pex::exceptions::Exception: 'Original kernel does not exist {0}; Visiting candidate {1}'

 jobs/imageDifference/24536/66398_imageDifference_24536_102.6164279.err lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates' 

I've run into the above errors in almost every RC2 weekly since w26: w26, w30, w34, w38. I was under the impression that this might be an expected error, but it does show up quite often.

The rest of the errors were like:

 1206/63914_imageDifference_1206_50.6178953.err:lsst.pex.exceptions.wrappers.InvalidParameterError: 1206/63914_imageDifference_1206_50.6178953.err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.' 

which are already known of: DM-31777

Show
Brock Brendal [X] (Inactive) added a comment - There was just a single error in step1, in characterizeImage: jobs/characterizeImage/ 36234 /58733_characterizeImage_36234_24. 5886370 .err: raise RuntimeError( "Unable to measure aperture correction for required algorithm '%s': " jobs/characterizeImage/ 36234 /58733_characterizeImage_36234_24. 5886370 .err:RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel' : only 1 sources, but require at least 2 . I haven't seen this kind of error since DC2 w28. For step4, there were 32 failures, all in  imageDifference . Some errors/exceptions had to do with KernelCandidacy and other various kernel things, examples of which include: 24538 /68380_imageDifference_24538_103. 6176959 .err: raise RuntimeError( "Cannot find any objects suitable for KernelCandidacy" ) 24538 /68380_imageDifference_24538_103. 6176959 .err:RuntimeError: Cannot find any objects suitable for KernelCandidacy jobs/imageDifference/ 11734 /64869_imageDifference_11734_93. 6174150 .err lsst::pex::exceptions::Exception: 'Original kernel does not exist {0}; Visiting candidate {1}' jobs/imageDifference/ 24536 /66398_imageDifference_24536_102. 6164279 .err lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates' I've run into the above errors in almost every RC2 weekly since w26: w26, w30, w34, w38. I was under the impression that this might be an expected error, but it does show up quite often.   The rest of the errors were like: 1206 /63914_imageDifference_1206_50. 6178953 .err:lsst.pex.exceptions.wrappers.InvalidParameterError: 1206 /63914_imageDifference_1206_50. 6178953 .err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.' which are already known of:  DM-31777
Hide
Brock Brendal [X] (Inactive) added a comment - - edited

step 5 had an error which was new to me. Just 1 in consolidateForcedSourceOnDiaObjectTable on tract 9813 (out of 3 tracts total):

 9813/86805_consolidateForcedSourceOnDiaObjectTable_9813_.6368577.err:numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.61 GiB for an array with shape (2, 510445768) and data type int64 

There was also 1 failed job in consolidateForcedSourceTable also on tract 9813:

 9813/731_consolidateForcedSourceTable_9813_.6368574.log: (0) Abnormal termination (signal 9) 

Usually, to solve the second error requires a simple increase to the memory request for that job, but after checking the memory usage for both of these then increasing the memory request for each accordingly, resubmission with the rescue file proved fruitless.

Here's the path: /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42/DM-32248/20211025T202954Z

Is it okay to leave those two jobs behind and proceed? Or is it worth it to try and submit them manually on the prod submit node with a much higher memory request?

P.S. consolidateForcedSourceTable only needed 208MB and 598MB for tracts 9615 and 9697, respectively, whereas the failed tract wanted ~85000 on the first try, and then ~111000MB on the second try which I found a bit strange

Show
Brock Brendal [X] (Inactive) added a comment - - edited step 5 had an error which was new to me. Just 1 in  consolidateForcedSourceOnDiaObjectTable  on tract 9813 (out of 3 tracts total): 9813 /86805_consolidateForcedSourceOnDiaObjectTable_9813_. 6368577 .err:numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.61 GiB for an array with shape ( 2 , 510445768 ) and data type int64 There was also 1 failed job in  consolidateForcedSourceTable  also on tract 9813: 9813 /731_consolidateForcedSourceTable_9813_. 6368574 .log: ( 0 ) Abnormal termination (signal 9 ) Usually, to solve the second error requires a simple increase to the memory request for that job, but after checking the memory usage for both of these then increasing the memory request for each accordingly, resubmission with the rescue file proved fruitless. Here's the path:  /scratch/brendal4/bps-gen3-rc2/w_2021_42/submit/HSC/runs/RC2/w_2021_42/ DM-32248 /20211025T202954Z Is it okay to leave those two jobs behind and proceed? Or is it worth it to try and submit them manually on the prod submit node with a much higher memory request? P.S.  consolidateForcedSourceTable  only needed 208MB and 598MB for tracts 9615 and 9697, respectively, whereas the failed tract wanted ~85000 on the first try, and then ~111000MB on the second try which I found a bit strange
Hide
Brock Brendal [X] (Inactive) added a comment - - edited

In faro, the 3 wPerp jobs failed due to a recurring (and known) too-large memory request issue. I attempted to run those 3 jobs by hand with "pipetask run" which was successful:

 pipetask --long-log run -b /repo/main/butler.yaml -i "HSC/runs/RC2/w_2021_42/DM-32248" --output "HSC/runs/RC2/w_2021_42/DM-32248" -p $FARO_DIR/pipelines/metrics_pipeline.yaml#wPerp -d "skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813) AND band!='N921'" --extend-run --skip-init-writes --skip-existing My mistake was not only having the wrong inCollection (should've been HSC/RC2/defaults as it had been for the rest of the run), but I also ran those three jobs while 21 other jobs from the same step were still held in the queue from the original attempt when the 3 wPerp jobs initially failed. The fix for this SNAFU was to rerun the faro step with  extraQgraphOptions: "--skip-existing-in HSC/runs/RC2/w_2021_42/DM-32248" in the submission yaml. This seems to have cleared things up, and it appears that all of the data which should be there are there. If someone discovers that something is missing, it's most likely due to this issue (Yusra AlSayyad ). Otherwise, I think I can wrap this weekly up at this point, and dispatch metrics for ingestion. Show Brock Brendal [X] (Inactive) added a comment - - edited In faro, the 3 wPerp jobs failed due to a recurring (and known) too-large memory request issue. I attempted to run those 3 jobs by hand with "pipetask run" which was successful: pipetask -- long -log run -b /repo/main/butler.yaml -i "HSC/runs/RC2/w_2021_42/DM-32248" --output "HSC/runs/RC2/w_2021_42/DM-32248" -p$FARO_DIR/pipelines/metrics_pipeline.yaml#wPerp -d "skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813) AND band!='N921'" --extend-run --skip-init-writes --skip-existing My mistake was not only having the wrong inCollection (should've been HSC/RC2/defaults as it had been for the rest of the run), but I also ran those three jobs while 21 other jobs from the same step were still held in the queue from the original attempt when the 3 wPerp jobs initially failed. The fix for this SNAFU was to rerun the faro step with extraQgraphOptions: "--skip-existing-in HSC/runs/RC2/w_2021_42/DM-32248" in the submission yaml. This seems to have cleared things up, and it appears that all of the data which should be there are there. If someone discovers that something is missing, it's most likely due to this issue ( Yusra AlSayyad ). Otherwise, I think I can wrap this weekly up at this point, and dispatch metrics for ingestion.
Hide
Yusra AlSayyad added a comment -

I only see bands g,r,i for AM1, AM2, PA1, TE1, and TE2. Would you check what happened to z and y?

Show
Yusra AlSayyad added a comment - Looking at the gen3 metrics ingested, https://chronograf-demo.lsst.codes/sources/2/dashboards/75?refresh=Paused&tempVars%5Btract%5D=9697&tempVars%5Bbutler_gen%5D=Gen3&tempVars%5Bband%5D=r&tempVars%5Bdataset%5D=HSC_RC2&tempVars%5Btime_start%5D=Use%20date%20picker&lower=2021-06-13T20%3A25%3A00.000Z&upper=2021-11-22T21%3A25%3A00.000Z# I only see bands g,r,i for AM1, AM2, PA1, TE1, and TE2. Would you check what happened to z and y?
Hide
Brock Brendal [X] (Inactive) added a comment -

Yusra AlSayyad I see what you mean....investigating now.

Show
Brock Brendal [X] (Inactive) added a comment - Yusra AlSayyad  I see what you mean....investigating now.
Hide
Brock Brendal [X] (Inactive) added a comment -

Yusra AlSayyad With the help of Michelle Gower and Simon Krughoff, we discovered that the missing data were there in my repo, they just missed getting ingested due to an error on my part (I had two dispatches going at once which confused things). The missing data are now where they should be on the Chronograf!

Show
Brock Brendal [X] (Inactive) added a comment - Yusra AlSayyad  With the help of Michelle Gower  and Simon Krughoff , we discovered that the missing data were there in my repo, they just missed getting ingested due to an error on my part (I had two dispatches going at once which confused things). The missing data are now where they should be on the Chronograf !

#### People

Assignee:
Brock Brendal [X] (Inactive)
Reporter:
Brock Brendal [X] (Inactive)
Reviewers:
Yusra AlSayyad
Watchers:
Brock Brendal [X] (Inactive), Yusra AlSayyad
Votes:
0 Vote for this issue
Watchers:
2 Start watching this issue

#### Dates

Created:
Updated:
Resolved:

#### Jenkins

No builds found.