Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13619

Modify usage.py for NODE_FAIL/COMPLETED failure case

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      +underlined text+When completing DM-13578, it was recognized that some of the SLURM IDs gave the following error when finding their information via sacct due to the GPSF outage on 24 Oct. 2017:

      JobID JobName NNodes Elapsed State ExitCode
      ------------ ---------- -------- ---------- ---------- --------
      Conflicting JOB_STEP record for jobstep 95210.0 at line 263284 -- ignoring it
      Conflicting JOB_STEP record for jobstep 95211.0 at line 263288 -- ignoring it
      Conflicting JOB_TERMINATED record (COMPLETED) for job 95210 at line 263355 -- ignoring it
      Conflicting JOB_TERMINATED record (COMPLETED) for job 95211 at line 263359 -- ignoring it
      95210 mtWide 3 00:04:28 NODE_FAIL 127:0
      95210.0 hydra_pmi+ 3 00:04:27 FAILED 7:0
      95210.1 hydra_pmi+ 3 07:44:48 COMPLETED 0:0
      95211 mtCosmos 4 00:04:04 NODE_FAIL 127:0
      95211.0 hydra_pmi+ 4 00:04:04 FAILED 7:0
      95211.1 hydra_pmi+ 4 10:10:13 COMPLETED 0:0
      

      So while the job initially failed, it was later run successfully with the same JobID.  Modify usage.py to allow for the inclusion of such jobs.

        Attachments

          Issue Links

            Activity

            Hide
            sthrush Samantha Thrush added a comment -

            The code has been completed and the review should be fairly quick.

            Show
            sthrush Samantha Thrush added a comment - The code has been completed and the review should be fairly quick.

              People

              Assignee:
              sthrush Samantha Thrush
              Reporter:
              sthrush Samantha Thrush
              Reviewers:
              Mikolaj Kowalik
              Watchers:
              Hsin-Fang Chiang, Mikolaj Kowalik, Samantha Thrush
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.