Details
-
Type:
Bug
-
Status: Invalid
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Epic Link:
-
Team:Data Release Production
-
Urgent?:No
Description
forcedPhotCcd is stalling (failing to produce more input) for most (but not all) RC2 jobs.
This relevant logs are in:
/datasets/hsc/repo/rerun/RC/w_2021_18/DM-29946/logs/forcedPhotCcd
9615-HSC-G
9615-HSC-I
9697-HSC-G
Worked. But all other jobs timed out after 48 hours. The were rerun and stalled on the same line they stalled on the same line (for 3.5 days).
One example, frCcd-9813-HSC-G-51939.log ,
ends with
forcedPhotCcd.references INFO: Getting references in
forcedPhotCcd.references INFO: Getting references in
{'tract': 9813, 'patch': '8,2'}forcedPhotCcd INFO: Performing forced measurement on
{'ccd': 101, 'tract': 9813, 'visit': 11692, 'pointing': 1052, 'filter': 'HSC-G', 'field': 'SSP_UDEEP_COSMOS', 'dateObs': '2014-11-18', 'taiObs': '2014-11-18', 'expTime': 300.0}forcedPhotCcd.measurement INFO: Performing forced measurement on 11638 sources
forcedPhotCcd.applyApCorr INFO: Applying aperture corrections to 3 instFlux fields
Logs for previous step are availble in:
/datasets/hsc/repo/rerun/RC/w_2021_18/DM-29946/logs/
Caveat:
Do to SLURM weirdness, a previous run failed (specifically, FGCM could not finish after bbeing run once on an incomplete set). Everything in /datasets/hsc/repo/rerun/RC/w_2021_18 was rm -f. The directory was confirmed to be empty. Everything was rerun and went fairly smoothly. But maybe there was some mysterious FGCM output someplace else?
Things seem to have gotten worse from w22 to w26; see comment on
DM-30864.(most tract/bands combinations timed out after 48h when running forcedPhotCcd with w26, while they ran through fine with w22;
DM-30424)