Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30271

forcedPhotCcd stalling for most RC2 jobs

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Invalid
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      forcedPhotCcd is stalling (failing to produce more input) for most (but not all) RC2 jobs.

      This relevant logs are in:
      /datasets/hsc/repo/rerun/RC/w_2021_18/DM-29946/logs/forcedPhotCcd

      9615-HSC-G
      9615-HSC-I
      9697-HSC-G

      Worked. But all other jobs timed out after 48 hours. The were rerun and stalled on the same line they stalled on the same line (for 3.5 days).

      One example, frCcd-9813-HSC-G-51939.log ,
      ends with
      forcedPhotCcd.references INFO: Getting references in

      {'tract': 9813, 'patch': '8,1'}

      forcedPhotCcd.references INFO: Getting references in

      {'tract': 9813, 'patch': '8,2'}

      forcedPhotCcd INFO: Performing forced measurement on

      {'ccd': 101, 'tract': 9813, 'visit': 11692, 'pointing': 1052, 'filter': 'HSC-G', 'field': 'SSP_UDEEP_COSMOS', 'dateObs': '2014-11-18', 'taiObs': '2014-11-18', 'expTime': 300.0}

      forcedPhotCcd.measurement INFO: Performing forced measurement on 11638 sources
      forcedPhotCcd.applyApCorr INFO: Applying aperture corrections to 3 instFlux fields

      Logs for previous step are availble in:
      /datasets/hsc/repo/rerun/RC/w_2021_18/DM-29946/logs/

      Caveat:

      Do to SLURM weirdness, a previous run failed (specifically, FGCM could not finish after bbeing run once on an incomplete set). Everything in /datasets/hsc/repo/rerun/RC/w_2021_18 was rm -f. The directory was confirmed to be empty. Everything was rerun and went fairly smoothly. But maybe there was some mysterious FGCM output someplace else?

        Attachments

          Issue Links

            Activity

            Hide
            mschmitz Morgan Schmitz added a comment -

            Things seem to have gotten worse from w22 to w26; see comment on DM-30864.

            (most tract/bands combinations timed out after 48h when running forcedPhotCcd with w26, while they ran through fine with w22; DM-30424)

            Show
            mschmitz Morgan Schmitz added a comment - Things seem to have gotten worse from w22 to w26; see comment on DM-30864 . (most tract/bands combinations timed out after 48h when running forcedPhotCcd with w26, while they ran through fine with w22; DM-30424 )
            Hide
            erykoff Eli Rykoff added a comment -

            This is a stale ticket, but I believe it is fixed by DM-41012 which is the problematic location in the code that would have caused this problem, so I'm marking it invalid.

            Show
            erykoff Eli Rykoff added a comment - This is a stale ticket, but I believe it is fixed by DM-41012 which is the problematic location in the code that would have caused this problem, so I'm marking it invalid.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              emorganson Eric Morganson [X] (Inactive)
              Watchers:
              Eli Rykoff, Eric Morganson [X] (Inactive), Morgan Schmitz
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.