Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:3
-
Epic Link:
-
Sprint:DRP S19-6b
-
Team:Data Release Production
Description
Both Hiroyuki Ikeda and I have encountered some difficult to reproduce errors in the background application stage of coaddDriver.py which looks like:
[31] Traceback (most recent call last):
|
[31] File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/ctrl_pool/7.0-hsc/python/lsst/ctrl/pool/parallel.py", line 509, in logOperation
|
[31] yield
|
[31] File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_drivers/7.4-hsc/python/lsst/pipe/drivers/coaddDriver.py", line 262, in warp
|
[31] self.makeCoaddTempExp.runDataRef(patchRef, selectDataList)
|
[31] File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_base/7.0-hsc/python/lsst/pipe/base/timer.py", line 150, in wrapper
|
[31] res = func(self, *args, **keyArgs)
|
[31] File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/7.0-hsc/python/lsst/pipe/tasks/makeCoaddTempExp.py", line 345, in runDataRef
|
[31] self.applySkyCorr(calExpRef, calExp)
|
[31] File "/ana/products.7.4/stack/miniconda3-4.5.12-1172c30/Linux64/pipe_tasks/7.0-hsc/python/lsst/pipe/tasks/makeCoaddTempExp.py", line 555, in applySkyCorr
|
[31] calexp -= bg.getImage()
|
[31] TypeError: __isub__(): incompatible function arguments. The following argument types are supported:
|
[31] 1. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: float) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
|
[31] 2. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst.afw.image.maskedImage.maskedImage.MaskedImageF) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
|
[31] 3. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst.afw.image.image.image.ImageF) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
|
[31] 4. (self: lsst.afw.image.maskedImage.maskedImage.MaskedImageF, arg0: lsst::afw::math::Function2<double>) -> lsst.afw.image.maskedImage.maskedImage.MaskedImageF
|
[31]
|
When I'd isolate the patch that failed and reran it, it would then infuriatingly succeed. So at first I thought these were transient GPFS errors, but it only appears when reading backgrounds.
Jim Bosch pointed me to the line that eats the Fits error: https://github.com/lsst/afw/blob/master/python/lsst/afw/math/backgroundList.py#L185
Setting a loop to read background files and re-raising the FitsError eventually yielded:
> /home/yusra/lsst_devel/LSST/DMS/afw/python/lsst/afw/math/backgroundList.py(191)readFits()
|
-> break
|
(Pdb) e
|
FitsError('cfitsio error: attempt to open too many files (103) : Opening file '/datasets/hsc/repo/rerun/DM-13666/WIDE/01052/HSC-G/corr/BKGD-0011602-073.fits' with mode 'r'
|
cfitsio error stack:
|
failed to find or open the following file: (ffopen)
|
/datasets/hsc/repo/rerun/DM-13666/WIDE/01052/HSC-G/corr/BKGD-0011602-073.fits
|
')
|
Bingo.
BackgroundList needs to close its fits files after reading and constructing the BackgroundList.
(SPs include not only time to fix but time the time scratching my head today and during the deblender sprint)
Attachments
Issue Links
- is triggering
-
DM-20027 Apparent file handle leaks in Image FITS reading
- Won't Fix
Explicitly closing the Fits objects didn't help. `tickets/yusra/
DM-20024` reraises the FitsError. But I'm stumped.snippet to reproduce on lsst_dev. It'll get to N=999, and fail on N=1000.
pdb.set_trace()
Jim is working on a solution on `u/jbosch/
DM-20024`