Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29918

Investigate bad cgroup/HTCondor memory analysis in BPS vs. fgcmcal

    XMLWordPrintable

    Details

      Description

      The "fgcmFitCycle" pipeline task currently (as of w_2021_17) uses a lot of memory in the configuration that's present in obs_subaru. Fixing this is DM-29916, but a separate issue is that this appears as a hang when running on BPS/HTCondor, rather than the usual out-of-memory held job.

      For more information, see this slack thread: https://lsstc.slack.com/archives/C01FBUGM2CV/p1619448184478600

      I will attach a BPS config file that can be used to reproduce the problem after creating the ticket.

        Attachments

          Issue Links

            Activity

            Hide
            mgower Michelle Gower added a comment -

            Greg Daues sent cgroup and htcondor logs to htcondor folks. It looks like htcondor may not be including all the memory usage reported by cgroups.

            DM-28653 included an attempt to tell htcondor to put jobs that die with signal 7 on hold (so they behave like other over memory jobs). Waiting to see if this works correctly in real runs. This, of course, doesn't help with the jobs that hang.

            Show
            mgower Michelle Gower added a comment - Greg Daues sent cgroup and htcondor logs to htcondor folks. It looks like htcondor may not be including all the memory usage reported by cgroups. DM-28653 included an attempt to tell htcondor to put jobs that die with signal 7 on hold (so they behave like other over memory jobs). Waiting to see if this works correctly in real runs. This, of course, doesn't help with the jobs that hang.
            Hide
            mgower Michelle Gower added a comment -

            HTCondor 9.0.5 reports to have the cgroups + /dev/shm bug fixed. https://htcondor.readthedocs.io/en/v9_0/version-history/stable-release-series-90.html#version-9-0-5 It isn't clear if/when NCSA would transition to running this version of HTCondor.

            Show
            mgower Michelle Gower added a comment - HTCondor 9.0.5 reports to have the cgroups + /dev/shm bug fixed. https://htcondor.readthedocs.io/en/v9_0/version-history/stable-release-series-90.html#version-9-0-5 It isn't clear if/when NCSA would transition to running this version of HTCondor.
            Hide
            mgower Michelle Gower added a comment -

            NCSA is upgrading only to 8.8.15 during the Sep maintenance. Still having conversations about whether/when to upgrade to 9.0.5. (I don't know what state to put this ticket into because mostly depends on getting really new version of HTCondor. So just leaving it "In Progress" for now.)

            Show
            mgower Michelle Gower added a comment - NCSA is upgrading only to 8.8.15 during the Sep maintenance. Still having conversations about whether/when to upgrade to 9.0.5. (I don't know what state to put this ticket into because mostly depends on getting really new version of HTCondor. So just leaving it "In Progress" for now.)
            Hide
            mgower Michelle Gower added a comment -

            Have heard that NCSA is setting up a 9.0.<latest> test system in December.

            Show
            mgower Michelle Gower added a comment - Have heard that NCSA is setting up a 9.0.<latest> test system in December.
            Hide
            mgower Michelle Gower added a comment -

            Because of the timeline of moving everything to SLAC plus issues with the newer HTCondor API and munge authentication, there was a management-level decision to not update the HTCondor services at NCSA.  Closing this ticket as done.  If someone runs into this problem running elsewhere using newer HTCondor services, please open a new ticket.

            Show
            mgower Michelle Gower added a comment - Because of the timeline of moving everything to SLAC plus issues with the newer HTCondor API and munge authentication, there was a management-level decision to not update the HTCondor services at NCSA.  Closing this ticket as done.  If someone runs into this problem running elsewhere using newer HTCondor services, please open a new ticket.

              People

              Assignee:
              mgower Michelle Gower
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Dan Taranu, Greg Daues, Jim Bosch, John Parejko, Michelle Gower
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.