Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: ctrl_bps
-
Epic Link:
-
Team:Data Facility
-
Urgent?:No
Description
Eli Rykoff reported that the BPS jobs for which automatic memory scaling was enabled remain idle in the job queue on the verification cluster at NCSA if the first run attempt failed due to insufficient memory.
Preliminary investigation suggests that there is an issue with the ClassAd expression governing memory scaling which prevents HTCondor from finding a matching resource. From the output generated by condor_q -better-analyze -reverse 1836647.0:
Job 1836647.0 has the following attributes:
|
|
TARGET.JobUniverse = 5
|
TARGET.Nodeset = "NORMAL"
|
TARGET.NumCkpts = 0
|
TARGET.RequestCpus = 1
|
TARGET.RequestDisk = 3750
|
TARGET.RequestMemory = error
|
TARGET.Walltime = 259200
|
For future reference, I'm attaching the full output (1836647.0-analyze.out) as well as the job ClassAd (1836647.0-classad.out).
Changes look good. Tested it with really low request_memory on pipelines_check on the DAC cluster and it retried twice increasing the memory and then finished successfully. Changes approved for merging.