Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-32066

BPS jobs with memory autoscaling enabled remain idle after the first run attempt

    XMLWordPrintable

    Details

      Description

      Eli Rykoff reported that the BPS jobs for which automatic memory scaling was enabled remain idle in the job queue on the verification cluster at NCSA if the first run attempt failed due to insufficient memory.

      Preliminary investigation suggests that there is an issue with the ClassAd expression governing memory scaling which prevents HTCondor from finding a matching resource. From the output generated by condor_q -better-analyze -reverse 1836647.0:

      Job 1836647.0 has the following attributes:
       
          TARGET.JobUniverse = 5
          TARGET.Nodeset = "NORMAL"
          TARGET.NumCkpts = 0
          TARGET.RequestCpus = 1
          TARGET.RequestDisk = 3750
          TARGET.RequestMemory = error
          TARGET.Walltime = 259200
      

      For future reference, I'm attaching the full output (1836647.0-analyze.out) as well as the job ClassAd (1836647.0-classad.out).

        Attachments

          Activity

          Hide
          mgower Michelle Gower added a comment -

          Changes look good. Tested it with really low request_memory on pipelines_check on the DAC cluster and it retried twice increasing the memory and then finished successfully. Changes approved for merging.

          Show
          mgower Michelle Gower added a comment - Changes look good. Tested it with really low request_memory on pipelines_check on the DAC cluster and it retried twice increasing the memory and then finished successfully. Changes approved for merging.

            People

            Assignee:
            mkowalik Mikolaj Kowalik
            Reporter:
            mkowalik Mikolaj Kowalik
            Reviewers:
            Michelle Gower
            Watchers:
            Michelle Gower, Mikolaj Kowalik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.