Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-38772

PSF models in ExposureCatalogs leak memory.

    XMLWordPrintable

Details

    • Bug
    • Status: In Progress
    • Resolution: Unresolved
    • None
    • afw
    • None
    • Data Release Production
    • No

    Description

      Test processing on v24 was blowing up memory when creating a visit table, which didn't make sense. I have tracked it down to a memory leak when PIFF PSFs are stored in ExposureCatalog objects like the finalVisitSummary tables.

      from lsst.daf.butler import Butler
      from fgcm.fgcmUtilities import getMemoryString
       
      butler = Butler("/repo/main", instrument="HSC", collections=["u/mccarthy/cm/HSC_RC2/v24.1.0.rc1_test2/step2cde/group0/w00_000"])
       
      refs = list(butler.registry.queryDatasets("finalVisitSummary"))
       
      print(len(refs))
       
      print(getMemoryString("Starting to read..."))
       
      for i, ref in enumerate(refs[0: 9]):
          visitSummary = butler.get(ref)
          print(getMemoryString("Have read in %d summaries." % (i + 1)))
      

      Memory usage at Done with imports...: 229 MB current; 18182 MB peak.
      431
      Memory usage at Starting to read...: 330 MB current; 18311 MB peak.
      Memory usage at Have read in 1 summaries.: 810 MB current; 18987 MB peak.
      Memory usage at Have read in 2 summaries.: 948 MB current; 19126 MB peak.
      Memory usage at Have read in 3 summaries.: 1199 MB current; 19377 MB peak.
      Memory usage at Have read in 4 summaries.: 1270 MB current; 19448 MB peak.
      Memory usage at Have read in 5 summaries.: 1536 MB current; 19715 MB peak.
      Memory usage at Have read in 6 summaries.: 1702 MB current; 19882 MB peak.
      Memory usage at Have read in 7 summaries.: 1749 MB current; 19929 MB peak.
      Memory usage at Have read in 8 summaries.: 1877 MB current; 20057 MB peak.
      Memory usage at Have read in 9 summaries.: 2094 MB current; 20275 MB peak.
      

      Even with new PSFs that don't have the heavy footprints from HSC/runs/RC2/w_2023_11/DM-38360 there's still a leak:

      Memory usage at Starting to read...: 496 MB current; 18817 MB peak.
      Memory usage at Have read in 1 summaries.: 527 MB current; 18817 MB peak.
      Memory usage at Have read in 2 summaries.: 559 MB current; 18817 MB peak.
      Memory usage at Have read in 3 summaries.: 584 MB current; 18841 MB peak.
      Memory usage at Have read in 4 summaries.: 610 MB current; 18866 MB peak.
      Memory usage at Have read in 5 summaries.: 635 MB current; 18891 MB peak.
      Memory usage at Have read in 6 summaries.: 660 MB current; 18917 MB peak.
      Memory usage at Have read in 7 summaries.: 685 MB current; 18941 MB peak.
      Memory usage at Have read in 8 summaries.: 710 MB current; 18966 MB peak.
      Memory usage at Have read in 9 summaries.: 736 MB current; 18991 MB peak.
      Memory usage at Have read in 10 summaries.: 761 MB current; 19017 MB peak.
      

      I don't know if this is a problem with the PSFEx model as well as the PIFF model.

      Attachments

        Issue Links

          Activity

            erykoff Eli Rykoff added a comment -

            I've attached a comprehensive leak test program. This doesn't even touch PIFF, it's the trampoline code that seems to be leaking. Any time we read a PSF via the ExposureFitsReader we never release the PSF memory, even if it's stored in in a calexp (ugh).

            Meanwhile, the same problem is there for the SummaryStats which uses the same trampoline. It just doesn't leak as much memory because it's small!

            erykoff Eli Rykoff added a comment - I've attached a comprehensive leak test program. This doesn't even touch PIFF, it's the trampoline code that seems to be leaking. Any time we read a PSF via the ExposureFitsReader we never release the PSF memory, even if it's stored in in a calexp (ugh). Meanwhile, the same problem is there for the SummaryStats which uses the same trampoline. It just doesn't leak as much memory because it's small!
            jbosch Jim Bosch added a comment -

            My preferred approach to this is now to switch us to the long-lived "smart holder" branch of pybind11, details here: https://lsstc.slack.com/archives/C2K9RHYTV/p1682362907354499

             

            jbosch Jim Bosch added a comment - My preferred approach to this is now to switch us to the long-lived "smart holder" branch of pybind11, details here: https://lsstc.slack.com/archives/C2K9RHYTV/p1682362907354499  
            jbosch Jim Bosch added a comment -

            Status report, first the good news:

            • I've got a fork of pybind11 with an lsst-dev branch that at least works when set up in-place (this is more convenient for my workflow; I just haven't checked to see if install works out-of-the-box via eupspkg recognizing the CMake scripts).  This is in the "lsst" GitHub repo.  The "lsst-dm" pybind11 appears to be a very old fork of the master branch only, done by Pim back when we were evaluating pybind11.  We should probably remove that.  There is also a legacy package with the pre-conda tarball wrappers.
            • I've got the stack building up to afw with the new branch, including having the cpputils tests that previously requires the leaky PySharedPtr hack pass, without PySharedPtr.  I have not tested for memory leaks with the new approach, but I'm not that worried about them.
            • I'm doing this in "progressive" mode, which means we do have to remove all explicit shared_ptr (or unique_ptr) holders in our class_ declarations.  That's been pretty easy to grep for, and even better, when you miss one, you get a compile error (which is moderately readable, at least by C++ compiler-error standards).

            And then the bad news:

            • Converting afw is currently stalled by some wrappers not working as expected.  For afw::table::Catalog types only, methods that return objects by value are failing at runtime to convert those return values, because apparently some deep-within-pybind11 logic that's supposed to check whether there's a move constructor is not working on just these types, in just these contexts. I have not been able to reproduce what's going on in more simple contexts, and there's a somewhat worrying "// SMART_HOLDER_WIP: Needs refactoring of existing pybind11 code." comment next to some of this logic.  I'm still working on this, but it's at least an indication that the pybind11 smart_holder branch may not be solid as we'd hoped.
            • It appears that Google (whose engineers were responsible for the smart_holder branch) have decided to fork pybind11 as well, into pywrapcc.   The history there may explain why smart_holder hasn't been merged back to the pybind11 master branch, and I don't know what it means for the future of smart_holder updates - though as of right now it's only one commit behind master (as well as many commits ahead, of course).  pywrapcc emphatically advertises itself as unstable right now, so depending on it instead seems a bit dicy, but it's not like depending on a pybind11 branch is a great state to be in, either.
            jbosch Jim Bosch added a comment - Status report, first the good news: I've got a fork of pybind11 with an lsst-dev branch that at least works when set up in-place (this is more convenient for my workflow; I just haven't checked to see if install works out-of-the-box via eupspkg recognizing the CMake scripts).  This is in the "lsst" GitHub repo.  The "lsst-dm" pybind11 appears to be a very old fork of the master branch only, done by Pim back when we were evaluating pybind11.  We should probably remove that.  There is also a legacy package with the pre-conda tarball wrappers. I've got the stack building up to afw with the new branch, including having the cpputils tests that previously requires the leaky PySharedPtr hack pass, without PySharedPtr.  I have not tested for memory leaks with the new approach, but I'm not that worried about them. I'm doing this in "progressive" mode, which means we do have to remove all explicit shared_ptr (or unique_ptr) holders in our  class_ declarations.  That's been pretty easy to grep for, and even better, when you miss one, you get a compile error (which is moderately readable, at least by C++ compiler-error standards). And then the bad news: Converting afw is currently stalled by some wrappers not working as expected.  For  afw::table::Catalog types only , methods that return objects by value are failing at runtime to convert those return values, because apparently some deep-within-pybind11 logic that's supposed to check whether there's a move constructor is not working on just these types , in just these contexts . I have not been able to reproduce what's going on in more simple contexts, and there's a somewhat worrying "// SMART_HOLDER_WIP: Needs refactoring of existing pybind11 code." comment next to some of this logic.  I'm still working on this, but it's at least an indication that the pybind11 smart_holder branch may not be solid as we'd hoped. It appears that Google (whose engineers were responsible for the smart_holder branch) have decided to fork pybind11 as well, into pywrapcc .   The history there may explain why smart_holder hasn't been merged back to the pybind11 master branch, and I don't know what it means for the future of smart_holder updates - though as of right now it's only one commit behind master (as well as many commits ahead, of course).  pywrapcc emphatically advertises itself as unstable right now, so depending on it instead seems a bit dicy, but it's not like depending on a pybind11 branch is a great state to be in, either.
            jbosch Jim Bosch added a comment -

            I've managed to patch the new pybind11 branch to get afw compiling.  Good news, but not enough to make me thrilled to be on this path.

            jbosch Jim Bosch added a comment - I've managed to patch the new pybind11 branch to get afw compiling.  Good news, but not enough to make me thrilled to be on this path.

            People

              jbosch Jim Bosch
              erykoff Eli Rykoff
              Eli Rykoff, Fabio Hernandez, Jim Bosch, Joshua Meyers, Krzysztof Findeisen, Quentin Le Boulc'h, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Jenkins

                  No builds found.