Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29899

DC2 Reprocessing with w_2021_16 (Gen3)

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      5
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Reprocess test-med-1 with w_2021_16 (and necessary fixes) to test gen3 pipeline and compare with gen2 conversion, following up on DM-29396.

        Attachments

          Issue Links

            Activity

            Hide
            dtaranu Dan Taranu added a comment - - edited

            I ran pipeline_bps.yaml in /project/dtaranu/dc2_gen3/w_2021_16/ with w_2021_16 and two additional fixes:

            /home/dtaranu/tix/DM-29899/meas_extensions_shapeHSM: fixes to shapeHSM in w17 (DM-29863), which should not have affected anything besides measure tasks and without which most (or all) of them would have failed; and:
            /home/dtaranu/tix/DM-29830/obs_lsst: config overrides in $OBS_LSST_DIR/pipelines/imsim/DRP.yaml, which are in for w18.

            See bps report --id /project/dtaranu/dc2_gen3/w_2021_16/submit/2.2i/runs/test-med-1/w_2021_16/DM-29899/20210427T212511Z for the initial run. Several failures were reported:

            mergeMeasurements, forcedPhotCcd, forcedPhotCoadd, forcedPhotDiffim: all transient database failures; I think the filesystem was having issues at the time. I am re-running all of the failures (see forcedPhot_resume_bps.yaml, forcedPhotDiffim_resume_bps.yaml, mergeMeasurements_resume_bps.yaml and corresponding log files) and they should finish later today.

            writeObjectTable: all failed due to DM-29943, and of course transformObjectTable and consolidateObjectTable didn't run as a result. Although that ticket isn't merged, the root cause was fixed in DM-29907 for w18, so these should all work for w20.

            imageDifference: 1523/11104 quanta failed for the following reasons:
            178 raise RuntimeError("Cannot find any objects suitable for KernelCandidacy")
            13 raise RuntimeError("No coadd PhotoCalib found!")
            1169 raise RuntimeError("Templates constructed from multiple Tracts not yet supported")
            163 lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates'

            I checked a few random coadds and compared with the gen2 run (DM-29770) and found that they differed significantly. Then checking a few of the input calexps, I found those also differed, both at the pixel level and in the masks (perhaps the latter caused the former?). I would have compared postISRCCD outputs but gen2 doesn't output those by default; we may want to enable that for the w20 run if that's the most obvious line of inquiry. Perhaps Lauren MacArthur can advise as she found mask-related gen2/3 parity issues in HSC recently.

            Show
            dtaranu Dan Taranu added a comment - - edited I ran pipeline_bps.yaml in /project/dtaranu/dc2_gen3/w_2021_16/ with w_2021_16 and two additional fixes: /home/dtaranu/tix/ DM-29899 /meas_extensions_shapeHSM : fixes to shapeHSM in w17 ( DM-29863 ), which should not have affected anything besides measure tasks and without which most (or all) of them would have failed; and: /home/dtaranu/tix/ DM-29830 /obs_lsst : config overrides in $OBS_LSST_DIR/pipelines/imsim/DRP.yaml , which are in for w18. See bps report --id /project/dtaranu/dc2_gen3/w_2021_16/submit/2.2i/runs/test-med-1/w_2021_16/ DM-29899 /20210427T212511Z for the initial run. Several failures were reported: mergeMeasurements, forcedPhotCcd, forcedPhotCoadd, forcedPhotDiffim: all transient database failures; I think the filesystem was having issues at the time. I am re-running all of the failures (see forcedPhot_resume_bps.yaml , forcedPhotDiffim_resume_bps.yaml , mergeMeasurements_resume_bps.yaml and corresponding log files) and they should finish later today. writeObjectTable: all failed due to DM-29943 , and of course transformObjectTable and consolidateObjectTable didn't run as a result. Although that ticket isn't merged, the root cause was fixed in DM-29907 for w18, so these should all work for w20. imageDifference: 1523/11104 quanta failed for the following reasons: 178 raise RuntimeError("Cannot find any objects suitable for KernelCandidacy") 13 raise RuntimeError("No coadd PhotoCalib found!") 1169 raise RuntimeError("Templates constructed from multiple Tracts not yet supported") 163 lsst::pex::exceptions::Exception: 'Unable to determine kernel sum; 0 candidates' I checked a few random coadds and compared with the gen2 run ( DM-29770 ) and found that they differed significantly. Then checking a few of the input calexps, I found those also differed, both at the pixel level and in the masks (perhaps the latter caused the former?). I would have compared postISRCCD outputs but gen2 doesn't output those by default; we may want to enable that for the w20 run if that's the most obvious line of inquiry. Perhaps Lauren MacArthur can advise as she found mask-related gen2/3 parity issues in HSC recently.
            Hide
            lauren Lauren MacArthur added a comment -

            The gen2 vs. gen3 coadds are almost certainly going to have differences at this stage (I’m still working through a few SFM kinks seen in HSC RC2 runs on DM-29819, and already know there are fgcm & jointcal differences, DM-29820 & DM-29821... I haven’t looked beyond that yet...) I’d be happy have a look into the SFM differences you’re seeing for DC2 data.  Let me know if I should spin off a separate ticket for that or if you’d prefer I post updates here.

            Show
            lauren Lauren MacArthur added a comment - The gen2 vs. gen3 coadds are almost certainly going to have differences at this stage (I’m still working through a few SFM kinks seen in HSC RC2 runs on DM-29819 , and already know there are fgcm & jointcal differences, DM-29820 & DM-29821 ... I haven’t looked beyond that yet...) I’d be happy have a look into the SFM differences you’re seeing for DC2 data.  Let me know if I should spin off a separate ticket for that or if you’d prefer I post updates here.
            Hide
            dtaranu Dan Taranu added a comment - - edited

            We're not running fgcm or jointcal so it can't be that. I think it's definitely worth a separate ticket. For the record:

            I picked a calexp from testdata_ci_imsim (ci_imsim is working and will be reviewed soon-ish, but the necessary data is in /repo/dc2 anyway) and they're far from equivalent:

            import lsst.daf.butler as dafButler
            import lsst.daf.persistence as dafPersist
            import numpy as np
            butler_dc2_gen3 = dafButler.Butler('/repo/dc2', collections='2.2i/runs/test-med-1/w_2021_16/DM-29899')
            butler_dc2_gen2 = dafPersist.Butler('/datasets/DC2/repoRun2.2i/rerun/w_2021_16/DM-29770/sfm')
            calexp_2 = butler_dc2_gen2.get('calexp', visit=257768, detector=161)
            calexp_3 = butler_dc2_gen3.get('calexp', visit=257768, detector=161)
            np.sum(calexp_2.getMask().getArray() != calexp_3.getMask().getArray())
            np.median(np.abs(calexp_2.getImage().getArray() - calexp_3.getImage().getArray()))
            

            ... shows 530 pixels off, and a median abs diff of 0.00053, which isn't huge, but it seems to propagate to larger differences in coadds (sky subtraction differences?).

            Show
            dtaranu Dan Taranu added a comment - - edited We're not running fgcm or jointcal so it can't be that. I think it's definitely worth a separate ticket. For the record: I picked a calexp from testdata_ci_imsim ( ci_imsim is working and will be reviewed soon-ish, but the necessary data is in /repo/dc2 anyway) and they're far from equivalent: import lsst.daf.butler as dafButler import lsst.daf.persistence as dafPersist import numpy as np butler_dc2_gen3 = dafButler.Butler( '/repo/dc2' , collections = '2.2i/runs/test-med-1/w_2021_16/DM-29899' ) butler_dc2_gen2 = dafPersist.Butler( '/datasets/DC2/repoRun2.2i/rerun/w_2021_16/DM-29770/sfm' ) calexp_2 = butler_dc2_gen2.get( 'calexp' , visit = 257768 , detector = 161 ) calexp_3 = butler_dc2_gen3.get( 'calexp' , visit = 257768 , detector = 161 ) np. sum (calexp_2.getMask().getArray() ! = calexp_3.getMask().getArray()) np.median(np. abs (calexp_2.getImage().getArray() - calexp_3.getImage().getArray())) ... shows 530 pixels off, and a median abs diff of 0.00053, which isn't huge, but it seems to propagate to larger differences in coadds (sky subtraction differences?).
            Hide
            lauren Lauren MacArthur added a comment -

            Thanks Dan.  I'm on it: DM-30048.

            Show
            lauren Lauren MacArthur added a comment - Thanks Dan.  I'm on it: DM-30048 .
            Hide
            dtaranu Dan Taranu added a comment - - edited

            The issue should be resolved in DM-30076 for w20, so gen2/3 should have bitwise parity in singleFrame.

            I checked the max memory usage for each task as follows:

            ls -A1 submit/2.2i/runs/test-med-1/w_2021_16/DM-29899/20210427T212511Z/jobs/ | awk '{print "grep \"Memory (\" submit/2.2i/runs/test-med-1/w_2021_16/DM-29899/20210427T212511Z/jobs/" $1 "/*.log | awk -F\"[[:blank:]]*[:]*[[:blank:]]*\" '\''{print \""$1"\",$4}'\'' | sort -rg -k 2,2 | head -n 1 >> memory.txt"}' > memory.bash
            

            The contents of memory.txt from running memory.bash:

            assembleCoadd 1350
            calibrate 615
            characterizeImage 567
            consolidateSourceTable 255
            consolidateVisitSummary 175
            deblend 7856
            detection 686
            forcedPhotCcd 2778
            forcedPhotCoadd 1805
            forcedPhotDiffim 2769
            imageDifference 2460
            init 578
            isr 690
            makeWarp 1422
            measure 2487
            mergeDetections 209
            mergeMeasurements 986
            selectGoodSeeingVisits 163
            templateGen 1311
            transformSourceTable 191
            writeObjectTable 887
            writeSourceTable 197
            

            I put a BPS yaml with these memory requests plus some extra in /project/dtaranu/dc2_gen3/w_2021_20/pipeline_bps.yaml. Unfortunately, with some testing, it looks like these are not actually good estimates of peak memory usage - I tried re-running with only 1024 for all of the singleFrame tasks and many of them were held. So I gave quite a lot more headroom, keeping in mind that with DM-29918, it's not impossible for jobs to get killed by exceeding usage rather than just getting held.

            Show
            dtaranu Dan Taranu added a comment - - edited The issue should be resolved in DM-30076 for w20, so gen2/3 should have bitwise parity in singleFrame. I checked the max memory usage for each task as follows: ls -A1 submit/2.2i/runs/test-med-1/w_2021_16/DM-29899/20210427T212511Z/jobs/ | awk '{print "grep \"Memory (\" submit/2.2i/runs/test-med-1/w_2021_16/DM-29899/20210427T212511Z/jobs/" $1 "/*.log | awk -F\"[[:blank:]]*[:]*[[:blank:]]*\" '\''{print \""$1"\",$4}'\'' | sort -rg -k 2,2 | head -n 1 >> memory.txt"}' > memory.bash The contents of memory.txt from running memory.bash : assembleCoadd 1350 calibrate 615 characterizeImage 567 consolidateSourceTable 255 consolidateVisitSummary 175 deblend 7856 detection 686 forcedPhotCcd 2778 forcedPhotCoadd 1805 forcedPhotDiffim 2769 imageDifference 2460 init 578 isr 690 makeWarp 1422 measure 2487 mergeDetections 209 mergeMeasurements 986 selectGoodSeeingVisits 163 templateGen 1311 transformSourceTable 191 writeObjectTable 887 writeSourceTable 197 I put a BPS yaml with these memory requests plus some extra in /project/dtaranu/dc2_gen3/w_2021_20/pipeline_bps.yaml . Unfortunately, with some testing, it looks like these are not actually good estimates of peak memory usage - I tried re-running with only 1024 for all of the singleFrame tasks and many of them were held. So I gave quite a lot more headroom, keeping in mind that with DM-29918 , it's not impossible for jobs to get killed by exceeding usage rather than just getting held.
            Hide
            madamow Monika Adamow added a comment -

            Everything looks good. I copied your yaml file for processing and will start it with w20. 

            Show
            madamow Monika Adamow added a comment - Everything looks good. I copied your yaml file for processing and will start it with w20. 

              People

              Assignee:
              dtaranu Dan Taranu
              Reporter:
              dtaranu Dan Taranu
              Reviewers:
              Monika Adamow
              Watchers:
              Dan Taranu, Huan Lin, Lauren MacArthur, Monika Adamow
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.