Show
added a comment - Step 5 had many jobs held for memory, for which I increased their RequestMemory and released them where they all ran to completion. That is, except for our usual culprits: wPerp. All 3 wPerp jobs failed and so I'm running them by hand with
pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM- 33402
--qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.569579
6 - 35426 --qgraph-node-id 1311f4d2-464b-4f5e-8f2e-a1ff6d8071f0,48c7626a-a57c-4ca1-8f89-acb0275acf0e,30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions
This worked for the task on tract 9697, but tract 9813 failed as it did for RC2 w50. I decided to let that one stay dead, and tried running the task for tract 9615 like this:
pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/RC2/defaults -o HSC/runs/RC2/w_2022_04/DM- 33402 --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796 - 35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --extend-run --no-versions
which led to the error:
ERROR 2022 - 02 -16T13: 04 : 15.156 - 06 : 00 lsst.daf.butler.cli.utils ()(utils.py: 897 ) - Caught an exception, details are in traceback:
Traceback (most recent call last):
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/cmd/commands.py" , line 104 , in run
qgraph = script.qgraph(pipelineObj=pipeline, **kwargs)
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cli/script/qgraph.py" , line 183 , in qgraph
qgraph = f.makeGraph(pipelineObj, args)
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 561 , in makeGraph
registry, collections, run = _ButlerFactory.makeRegistryAndCollections(args)
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 350 , in makeRegistryAndCollections
butler, inputs, self = cls._makeReadParts(args)
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 288 , in _makeReadParts
self.check(args)
File "/software/lsstsw/stack_20220125/stack/miniconda3-py38_4.9.2-1.0.0/Linux64/ctrl_mpexec/g80d878e12a+0b7fd0f2a5/python/lsst/ctrl/mpexec/cmdLineFwk.py" , line 237 , in check
raise ValueError(
ValueError: Output CHAINED collection 'HSC/runs/RC2/w_2022_04/DM-33402' exists, but it ends with a different sequence of input collections than those given: 'HSC/calib/DM-33629' != 'HSC/raw/RC2/9813' in inputs=( 'HSC/raw/RC2/9615' , 'HSC/raw/RC2/9697' , 'HSC/raw/RC2/9813' , 'HSC/calib/DM-33629' , 'HSC/calib/gen2/20180117' , 'HSC/calib/DM-28636' , 'HSC/calib/gen2/20180117/unbounded' , 'HSC/calib/DM-28636/unbounded' , 'HSC/masks/s18a' , 'HSC/fgcmcal/lut/RC2/DM-28636' , 'refcats/DM-28636' , 'skymaps' ) vs HSC/runs/RC2/w_2022_04/DM- 33402 =( 'HSC/runs/RC2/w_2022_04/DM-33402/20220212T182516Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220131T153208Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T212035Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220128T163843Z' , 'HSC/runs/RC2/w_2022_04/DM-33402/20220126T162918Z' , 'HSC/raw/RC2/9615' , 'HSC/raw/RC2/9697' , 'HSC/raw/RC2/9813' , 'HSC/calib/gen2/20180117' , 'HSC/calib/DM-28636' , 'HSC/calib/gen2/20180117/unbounded' , 'HSC/calib/DM-28636/unbounded' , 'HSC/masks/s18a' , 'HSC/fgcmcal/lut/RC2/DM-28636' , 'refcats/DM-28636' , 'skymaps' ).
which I believe was caused due to the original collection being changed halfway through this pipetask run (discussed in a Slack thread here ). To get around this, I changed my pipetask command to use a different input collection and a different output collection:
pipetask -- long -log --log-level=VERBOSE run -b /repo/main/butler.yaml -i HSC/runs/RC2/w_2022_04/DM- 33402 -o HSC/runs/RC2/w_2022_04/DM- 33402 -b --qgraph /scratch/brendal4/bps-gen3-rc2/w_2022_04/submit/HSC/runs/RC2/w_2022_04/DM- 33402 /20220212T182516Z/HSC_runs_RC2_w_2022_04_DM-33402_20220212T182516Z.qgraph --qgraph-id 1644698236.5695796 - 35426 --qgraph-node-id 30933a2b-a5ab-48ca-8f08-46ff705aab06 --clobber-outputs --skip-init-writes --no-versions
For steps 6 and 7, I will update their submission yamls with the new output collection
Run was restarted with a lower maximum number of jobs in order to put less strain on the GPFS. Step 4 is now complete, with 36 of the same "Cannot compute CoaddPsf" errors as have been seen in the previous RC2 runs (see RC2 w46):
Beginning step 5 now.