Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Labels:
-
Epic Link:
-
Team:Data Facility
-
Urgent?:No
Description
The "fgcmFitCycle" pipeline task currently (as of w_2021_17) uses a lot of memory in the configuration that's present in obs_subaru. Fixing this is DM-29916, but a separate issue is that this appears as a hang when running on BPS/HTCondor, rather than the usual out-of-memory held job.
For more information, see this slack thread: https://lsstc.slack.com/archives/C01FBUGM2CV/p1619448184478600
I will attach a BPS config file that can be used to reproduce the problem after creating the ticket.
Greg Daues sent cgroup and htcondor logs to htcondor folks. It looks like htcondor may not be including all the memory usage reported by cgroups.
DM-28653included an attempt to tell htcondor to put jobs that die with signal 7 on hold (so they behave like other over memory jobs). Waiting to see if this works correctly in real runs. This, of course, doesn't help with the jobs that hang.