Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-20024

BackgroundList.readFits doesn't close fits files

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Story Points:
      3
    • Sprint:
      DRP S19-6b
    • Team:
      Data Release Production

      Description

      Both Hiroyuki Ikeda and I have encountered some difficult to reproduce errors in the background application stage of coaddDriver.py which looks like:

      [31] Traceback (most recent call last):
      [31]   File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/ctrl_pool/7.0-hsc/python/lsst/ctrl/pool/parallel.py", line 509, in logOperation
      [31]     yield
      [31]   File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_drivers/7.4-hsc/python/lsst/pipe/drivers/coaddDriver.py", line 262, in warp
      [31]     self.makeCoaddTempExp.runDataRef(patchRef, selectDataList)
      [31]   File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/7.0-hsc/python/lsst/pipe/base/timer.py", line 150, in wrapper
      [31]     res = func(self, *args, **keyArgs)
      [31]   File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/7.0-hsc/python/lsst/pipe/tasks/makeCoaddTempExp.py", line 345, in runDataRef
      [31]     self.applySkyCorr(calExpRef, calExp)
      [31]   File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/7.0-hsc/python/lsst/pipe/tasks/makeCoaddTempExp.py", line 555, in applySkyCorr
      [31]     calexp -= bg.getImage()
      [31] TypeError: __isub__(): incompatible function arguments. The following argument types are supported:
      [31]     1. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: float) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
      [31]     2. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst.afw.image.maskedImage.maskedImage.MaskedImageF) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
      [31]     3. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst.afw.image.image.image.ImageF) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
      [31]     4. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst::afw::math::Function2<double>) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
      [31]
      

      When I'd isolate the patch that failed and reran it, it would then infuriatingly succeed. So at first I thought these were transient GPFS errors, but it only appears when reading backgrounds.

      Jim Bosch pointed me to the line that eats the Fits error: https://github.com/lsst/afw/blob/master/python/lsst/afw/math/backgroundList.py#L185

      Setting a loop to read background files and re-raising the FitsError eventually yielded:

      > /home/yusra/lsst_devel/LSST/DMS/afw/python/lsst/afw/math/backgroundList.py(191)readFits()
      -> break
      (Pdb) e
      FitsError('cfitsio error: attempt to open too many files (103) : Opening file '/datasets/hsc/repo/rerun/DM-13666/WIDE/01052/HSC-G/corr/BKGD-0011602-073.fits' with mode 'r'
      cfitsio error stack:
        failed to find or open the following file: (ffopen)
        /datasets/hsc/repo/rerun/DM-13666/WIDE/01052/HSC-G/corr/BKGD-0011602-073.fits
      ')
      

      Bingo.

      BackgroundList needs to close its fits files after reading and constructing the BackgroundList.

      (SPs include not only time to fix but time the time scratching my head today and during the deblender sprint)

        Attachments

          Issue Links

            Activity

            Hide
            yusra Yusra AlSayyad added a comment -

            Explicitly closing the Fits objects didn't help. `tickets/yusra/DM-20024` reraises the FitsError. But I'm stumped.

            snippet to reproduce on lsst_dev. It'll get to N=999, and fail on N=1000.

             
            import pandas
            import lsst.daf.persistence as dafPersist
            import numpy as np
             
            root = "/datasets/hsc/repo/rerun/DM-13666/WIDE"
            butler = dafPersist.Butler(root)
             
            df = pandas.DataFrame.from_csv('/project/yusra/background-HSC-Y/per_ccd/HSC-G/g_visit.csv', header=None)
            visitList = df[1].values.tolist()
             
            j = 0
            for visit in visitList:
                for ccd in np.concatenate((np.arange(0, 9), np.arange(10, 104))):
                    try:
                        bg = butler.get("calexpBackground", visit=int(visit), ccd=int(ccd), immediate=True)
                        if bg.getImage() is None:
                            import pdb
                            pdb.set_trace()
                        print(j, visit, ccd)
                        j += 1
                    except Exception as e:
                        print(e)
                        if ccd == 0:
                            # if the first ccd of the visit isn't there,
                            # the whole visit probably isn't.
                            break
                        continue
             

            Jim is working on a solution on `u/jbosch/DM-20024`

            Show
            yusra Yusra AlSayyad added a comment - Explicitly closing the Fits objects didn't help. `tickets/yusra/ DM-20024 ` reraises the FitsError. But I'm stumped. snippet to reproduce on lsst_dev. It'll get to N=999, and fail on N=1000.   import pandas import lsst.daf.persistence as dafPersist import numpy as np   root = "/datasets/hsc/repo/rerun/DM-13666/WIDE" butler = dafPersist.Butler(root)   df = pandas.DataFrame.from_csv( '/project/yusra/background-HSC-Y/per_ccd/HSC-G/g_visit.csv' , header = None ) visitList = df[ 1 ].values.tolist()   j = 0 for visit in visitList: for ccd in np.concatenate((np.arange( 0 , 9 ), np.arange( 10 , 104 ))): try : bg = butler.get( "calexpBackground" , visit = int (visit), ccd = int (ccd), immediate = True ) if bg.getImage() is None : import pdb pdb.set_trace() print (j, visit, ccd) j + = 1 except Exception as e: print (e) if ccd = = 0 : # if the first ccd of the visit isn't there, # the whole visit probably isn't. break continue Jim is working on a solution on `u/jbosch/ DM-20024 `
            Hide
            jbosch Jim Bosch added a comment -
            Show
            jbosch Jim Bosch added a comment - PR is https://github.com/lsst/afw/pull/469.  
            Hide
            jbosch Jim Bosch added a comment -

            I've created and linked DM-20027 to follow up on the original file handle leak at lower priority.

            Show
            jbosch Jim Bosch added a comment - I've created and linked DM-20027 to follow up on the original file handle leak at lower priority.
            Hide
            yusra Yusra AlSayyad added a comment -

            Works. Looks good. So much more readable than before. Thank you.

            Show
            yusra Yusra AlSayyad added a comment - Works. Looks good. So much more readable than before. Thank you.
            Hide
            jbosch Jim Bosch added a comment -

            Merged to master after successful Jenkins ci_hsc run.

            Show
            jbosch Jim Bosch added a comment - Merged to master after successful Jenkins ci_hsc run.

              People

              Assignee:
              yusra Yusra AlSayyad
              Reporter:
              yusra Yusra AlSayyad
              Reviewers:
              Yusra AlSayyad
              Watchers:
              Jim Bosch, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.