Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33402

Gen3 RC2 reprocessing with bps and w_2022_04

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Submit path: /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM-33402

        Attachments

          Issue Links

            Activity

            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Run was restarted with a lower maximum number of jobs in order to put less strain on the GPFS. Step 4 is now complete, with 36 of the same "Cannot compute CoaddPsf" errors as have been seen in the previous RC2 runs (see RC2 w46):

            1206/62931_imageDifference_1206_50.8894067.err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.'
            

            Beginning step 5 now.

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Run was restarted with a lower maximum number of jobs in order to put less strain on the GPFS. Step 4 is now complete, with 36 of the same "Cannot compute CoaddPsf" errors as have been seen in the previous RC2 runs (see RC2 w46): 1206 /62931_imageDifference_1206_50. 8894067 .err:lsst::pex::exceptions::InvalidParameterError: 'Cannot compute CoaddPsf at point (21420.8, 17809.5); no input images at that point.' Beginning step 5 now.
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Step 5 submitted. Submission process took ~3.2 hrs

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Step 5 submitted. Submission process took ~3.2 hrs
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Step 5 had many jobs held for memory, for which I increased their RequestMemory and released them where they all ran to completion. That is, except for our usual culprits: wPerp. All 3 wPerp jobs failed and so I'm running them by hand with

            pipetask --long-log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM-33402
            --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.569579
            6-35426 --qgraph-node-id 1311f4d2-464b-4f5e-8f2e-a1ff6d8071f0,48c7626a-a57c-4ca1-8f89-acb0275acf0e,30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions
            

            This worked for the task on tract 9697, but tract 9813 failed as it did for RC2 w50. I decided to let that one stay dead, and tried running the task for tract 9615 like this:

            pipetask --long-log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM-33402 --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796-35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions
            

            which led to the error:

            ERROR 2022-02-16T13:04:15.156-06:00 lsst.daf.butler.cli.utils ()(utils.py:897) - Caught an exception, details are in traceback:
            Traceback (most recent call last):
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/cmd/commands.py", line 104, in run
                qgraph = script.qgraph(pipelineObj=pipeline, **kwargs)
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/script/qgraph.py", line 183, in qgraph
                qgraph = f.makeGraph(pipelineObj, args)
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 561, in makeGraph
                registry, collections, run = _ButlerFactory.makeRegistryAndCollections(args)
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 350, in makeRegistryAndCollections
                butler, inputs, self = cls._makeReadParts(args)
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 288, in _makeReadParts
                self.check(args)
              File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 237, in check
                raise ValueError(
            ValueError: Output CHAINED collection 'HSC/runs/RC2/w_2022_04/DM-33402' exists, but it ends with a different sequence of input collections than those given: 'HSC/calib/DM-33629' != 'HSC/raw/RC2/9813' in inputs=('HSC/raw/RC2/9615', 'HSC/raw/RC2/9697', 'HSC/raw/RC2/9813', 'HSC/calib/DM-33629', 'HSC/calib/gen2/20180117', 'HSC/calib/DM-28636', 'HSC/calib/gen2/20180117/unbounded', 'HSC/calib/DM-28636/unbounded', 'HSC/masks/s18a', 'HSC/fgcmcal/lut/RC2/DM-28636', 'refcats/DM-28636', 'skymaps') vs HSC/runs/RC2/w_2022_04/DM-33402=('HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z', 'HSC/runs/RC2/w_2022_04/DM-33402/20220131T153208Z', 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T212035Z', 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T163843Z', 'HSC/runs/RC2/w_2022_04/DM-33402/20220126T162918Z', 'HSC/raw/RC2/9615', 'HSC/raw/RC2/9697', 'HSC/raw/RC2/9813', 'HSC/calib/gen2/20180117', 'HSC/calib/DM-28636', 'HSC/calib/gen2/20180117/unbounded', 'HSC/calib/DM-28636/unbounded', 'HSC/masks/s18a', 'HSC/fgcmcal/lut/RC2/DM-28636', 'refcats/DM-28636', 'skymaps').
            

            which I believe was caused due to the original collection being changed halfway through this pipetask run (discussed in a Slack thread here). To get around this, I changed my pipetask command to use a different input collection and a different output collection:

            pipetask --long-log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/runs/RC2/w_2022_04/DM-33402 -o HSC/runs/RC2/w_2022_04/DM-33402-b --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796-35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --no-versions
            

            For steps 6 and 7, I will update their submission yamls with the new output collection

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Step 5 had many jobs held for memory, for which I increased their RequestMemory and released them where they all ran to completion. That is, except for our usual culprits: wPerp. All 3 wPerp jobs failed and so I'm running them by hand with pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM- 33402 --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.569579 6 - 35426 --qgraph-node-id 1311f4d2-464b-4f5e-8f2e-a1ff6d8071f0,48c7626a-a57c-4ca1-8f89-acb0275acf0e,30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions This worked for the task on tract 9697, but tract 9813 failed as it did for RC2 w50. I decided to let that one stay dead, and tried running the task for tract 9615 like this: pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM- 33402 --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796 - 35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions which led to the error: ERROR 2022 - 02 -16T13: 04 : 15.156 - 06 : 00 lsst.daf.butler.cli.utils ()(utils.py: 897 ) - Caught an exception, details are in traceback: Traceback (most recent call last): File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/cmd/commands.py" , line 104 , in run qgraph = script.qgraph(pipelineObj=pipeline, **kwargs) File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/script/qgraph.py" , line 183 , in qgraph qgraph = f.makeGraph(pipelineObj, args) File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 561 , in makeGraph registry, collections, run = _ButlerFactory.makeRegistryAndCollections(args) File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 350 , in makeRegistryAndCollections butler, inputs, self = cls._makeReadParts(args) File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 288 , in _makeReadParts self.check(args) File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 237 , in check raise ValueError( ValueError: Output CHAINED collection 'HSC/runs/RC2/w_2022_04/DM-33402' exists, but it ends with a different sequence of input collections than those given: 'HSC/calib/DM-33629' != 'HSC/raw/RC2/9813' in inputs=( 'HSC/raw/RC2/9615' , 'HSC/raw/RC2/9697' , 'HSC/raw/RC2/9813' , 'HSC/calib/DM-33629' , 'HSC/calib/gen2/20180117' , 'HSC/calib/DM-28636' , 'HSC/calib/gen2/20180117/unbounded' , 'HSC/calib/DM-28636/unbounded' , 'HSC/masks/s18a' , 'HSC/fgcmcal/lut/RC2/DM-28636' , 'refcats/DM-28636' , 'skymaps' ) vs HSC/runs/RC2/w_2022_04/DM- 33402 =( 'HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220131T153208Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T212035Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T163843Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220126T162918Z' , 'HSC/raw/RC2/9615' , 'HSC/raw/RC2/9697' , 'HSC/raw/RC2/9813' , 'HSC/calib/gen2/20180117' , 'HSC/calib/DM-28636' , 'HSC/calib/gen2/20180117/unbounded' , 'HSC/calib/DM-28636/unbounded' , 'HSC/masks/s18a' , 'HSC/fgcmcal/lut/RC2/DM-28636' , 'refcats/DM-28636' , 'skymaps' ). which I believe was caused due to the original collection being changed halfway through this pipetask run (discussed in a Slack thread here ). To get around this, I changed my pipetask command to use a different input collection and a different output collection: pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/runs/RC2/w_2022_04/DM- 33402 -o HSC/runs/RC2/w_2022_04/DM- 33402 -b --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796 - 35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --no-versions For steps 6 and 7, I will update their submission yamls with the new output collection
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            Steps 6 and 7 submitted and complete, wrapping up this weekly.

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - Steps 6 and 7 submitted and complete, wrapping up this weekly.
            Hide
            brendal4 Brock Brendal [X] (Inactive) added a comment -

            This run has the same issue mentioned in w08 where the wPerp jobs are missing from the dashboard.

            Show
            brendal4 Brock Brendal [X] (Inactive) added a comment - This run has the same issue mentioned in w08 where the wPerp jobs are missing from the dashboard.

              People

              Assignee:
              brendal4 Brock Brendal [X] (Inactive)
              Reporter:
              brendal4 Brock Brendal [X] (Inactive)
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Brock Brendal [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.