Fix Version/s: None
The "fgcmFitCycle" pipeline task currently (as of w_2021_17) uses a lot of memory in the configuration that's present in obs_subaru. Fixing this is
DM-29916, but a separate issue is that this appears as a hang when running on BPS/HTCondor, rather than the usual out-of-memory held job.
For more information, see this slack thread: https://lsstc.slack.com/archives/C01FBUGM2CV/p1619448184478600
I will attach a BPS config file that can be used to reproduce the problem after creating the ticket.
HTCondor 9.0.5 reports to have the cgroups + /dev/shm bug fixed. https://htcondor.readthedocs.io/en/v9_0/version-history/stable-release-series-90.html#version-9-0-5 It isn't clear if/when NCSA would transition to running this version of HTCondor.
NCSA is upgrading only to 8.8.15 during the Sep maintenance. Still having conversations about whether/when to upgrade to 9.0.5. (I don't know what state to put this ticket into because mostly depends on getting really new version of HTCondor. So just leaving it "In Progress" for now.)
Have heard that NCSA is setting up a 9.0.<latest> test system in December.
Because of the timeline of moving everything to SLAC plus issues with the newer HTCondor API and munge authentication, there was a management-level decision to not update the HTCondor services at NCSA. Closing this ticket as done. If someone runs into this problem running elsewhere using newer HTCondor services, please open a new ticket.
Greg Daues sent cgroup and htcondor logs to htcondor folks. It looks like htcondor may not be including all the memory usage reported by cgroups.
DM-28653included an attempt to tell htcondor to put jobs that die with signal 7 on hold (so they behave like other over memory jobs). Waiting to see if this works correctly in real runs. This, of course, doesn't help with the jobs that hang.