Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_bps
-
Epic Link:
-
Team:Data Facility
-
Urgent?:No
Description
Eli Rykoff reported that the BPS jobs for which automatic memory scaling was enabled remain idle in the job queue on the verification cluster at NCSA if the first run attempt failed due to insufficient memory.
Preliminary investigation suggests that there is an issue with the ClassAd expression governing memory scaling which prevents HTCondor from finding a matching resource. From the output generated by condor_q -better-analyze -reverse 1836647.0:
Job 1836647.0 has the following attributes:
|
|
TARGET.JobUniverse = 5
|
TARGET.Nodeset = "NORMAL"
|
TARGET.NumCkpts = 0
|
TARGET.RequestCpus = 1
|
TARGET.RequestDisk = 3750
|
TARGET.RequestMemory = error
|
TARGET.Walltime = 259200
|
For future reference, I'm attaching the full output (1836647.0-analyze.out) as well as the job ClassAd (1836647.0-classad.out).
Attachments
Activity
Field | Original Value | New Value |
---|---|---|
Labels | ctr |
Labels | ctr |
Component/s | ctrl_bps [ 18701 ] |
Status | To Do [ 10001 ] | In Progress [ 3 ] |
Reviewers | Michelle Gower [ mgower ] | |
Status | In Progress [ 3 ] | In Review [ 10004 ] |
Status | In Review [ 10004 ] | Reviewed [ 10101 ] |
Resolution | Done [ 10000 ] | |
Status | Reviewed [ 10101 ] | Done [ 10002 ] |
Labels | backport-v23 |
Labels | backport-v23 | backport-v23 gen3-middleware |
Labels | backport-v23 gen3-middleware | backport-approved backport-v23 gen3-middleware |
Labels | backport-approved backport-v23 gen3-middleware | backport-approved backport-done backport-v23 gen3-middleware |
Changes look good. Tested it with really low request_memory on pipelines_check on the DAC cluster and it retried twice increasing the memory and then finished successfully. Changes approved for merging.