Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13667

singleFrameDriver of the HSC PDR1 dataset

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      singleFrameDriver/ProcessCcd of the entire HSC PDR1 dataset

        Attachments

        1. fsfm-deep-136273.out
          46 kB
        2. fsfm-udeep-136275.out
          82 kB
        3. fsfm-wide-136933.out
          1.02 MB

          Issue Links

            Activity

            hchiang2 Hsin-Fang Chiang created issue -
            hchiang2 Hsin-Fang Chiang made changes -
            Field Original Value New Value
            Link This issue is child task of DM-13666 [ DM-13666 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Epic Link DM-13926 [ 39699 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Risk Score 0
            hchiang2 Hsin-Fang Chiang made changes -
            Link This issue relates to DM-14181 [ DM-14181 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Comment [ A non-small percentage (~30%) of jobs failed at slurm failing to launch the job. This is after {{singleFrameDriver.py}} successfully submitted the job to slurm, the job waited in the queue for its turn, and the job started trying once it got its turn, but then the job failed to launch. This failure wasn't restricted to one specific worker node.

            Log records like the following are seen:
            {noformat}
            srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
            srun: error: Task launch for 136725.0 failed on node lsst-verify-worker03: Socket timed out on send/recv operation
            srun: error: Application launch failed: Socket timed out on send/recv operation
            srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
            slurmstepd: error: *** STEP 136725.0 ON lsst-verify-worker02 CANCELLED AT 2018-04-22T20:53:48 ***
            srun: error: lsst-verify-worker02: task 0: Killed
            srun: error: lsst-verify-worker04: task 2: Killed
            [mpiexec@lsst-verify-worker02] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
            [mpiexec@lsst-verify-worker02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
            [mpiexec@lsst-verify-worker02] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
            [mpiexec@lsst-verify-worker02] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
            {noformat}
             
            All failures were resubmitted iteratively, with many failed again in the iterations, and eventually were all pushed though.
              ]
            hchiang2 Hsin-Fang Chiang made changes -
            Attachment fsfm-deep-136273.out [ 32547 ]
            Attachment fsfm-udeep-136275.out [ 32548 ]
            Attachment fsfm-wide-136933.out [ 32549 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Link This issue relates to IHS-1009 [ IHS-1009 ]
            hchiang2 Hsin-Fang Chiang made changes -
            Story Points 7
            hchiang2 Hsin-Fang Chiang made changes -
            Resolution Done [ 10000 ]
            Status In Progress [ 3 ] Done [ 10002 ]

              People

              Assignee:
              hchiang2 Hsin-Fang Chiang
              Reporter:
              hchiang2 Hsin-Fang Chiang
              Watchers:
              Hsin-Fang Chiang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.