Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-10183

Investigate why maxtasksperchild=1 causes mosaic.py to hang on pybind11 stack

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: meas_mosaic
    • Labels:
      None

      Description

      mosaic.py uses multiprocessing.Pool to read catalogs with multiple cores. When this pool is initialized with maxtasksperchild=1, mosaic.py hangs indefinitely at a consistent point in the running---that is, running with the same arguments multiple times will freeze up in the same place. This is only a problem with the pybind11 version of the stack, as this behavior does not occur in the HSC stack, which is currently still wrapped with swig. The underlying cause of this should be investigated to make sure that there is not some deeper issue that might cause problems with parallelization elsewhere.

        Attachments

          Issue Links

            Activity

            Hide
            swinbank John Swinbank added a comment -

            Almost certain this is a duplicate of DM-10834. Let's leave it open just for the purposes of confirming that — ie, when DM-10834 is done, we should try re-runnning mosaic.py and check that it's fixed.

            Show
            swinbank John Swinbank added a comment - Almost certain this is a duplicate of DM-10834 . Let's leave it open just for the purposes of confirming that — ie, when DM-10834 is done, we should try re-runnning mosaic.py and check that it's fixed.
            Hide
            price Paul Price added a comment -

            As John Swinbank suspected, with DM-10834 fixed and DM-10161 reverted, this works just fine:

            pprice@tiger-sumire:/tigress/pprice/dm-10183 $ mosaic.py /tigress/HSC/users/price/bell/DATA/rerun/20170630/m81 --output /tigress/pprice/dm-10183/DATA -c allowMixedFilters=True --numCoresForRead=12 --diagnostics --diagDir=/tigress/pprice/dm-10183/diag --id tract=0 field=M81_1^M81_2^M81_F1^M81_F2^M81_F3^M81_F4 filter=HSC-I^HSC-I2 ccd=0..8^10..103 --no-versions
            [...]
            solveMosaic_CCD: 3th iteration calcChi2: 1.896429e-04 1.899754e-04
            solveMosaic_CCD: 3th iteration matched: 0.063 (arcsec) sources: 0.011 (arcsec)
            nreject = 699
            nreject = 571
            Mosaic INFO: Write New WCS ...
            Mosaic INFO: Output WCS Diagnostic Figures...
            [...]
            fluxFitAbsolute CCD 100: 1.00563
            fluxFitAbsolute CCD 101: 1.00029
            fluxFitAbsolute CCD 102: 1.00601
            fluxFitAbsolute CCD 103: 1.00048
            Mosaic INFO: Write Fcr ...
            Mosaic INFO: Output Flux Diagnostic Figures...
            /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:827: RuntimeWarning: invalid value encountered in divide
              ra /= numbers
            /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:828: RuntimeWarning: invalid value encountered in divide
              dec /= numbers
            /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:829: RuntimeWarning: invalid value encountered in divide
              mag /= err
            /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:830: RuntimeWarning: invalid value encountered in sqrt
              err = numpy.sqrt((var - mag*mag*err)/err)
            pprice@tiger-sumire:/tigress/pprice/dm-10183 $ 
            

            The code on the ticket branch of meas_mosaic is a revert of DM-10161. Tim Morton, could you please sign off on this?

            pprice@tiger-sumire:/tigress/pprice/dm-10183/meas_mosaic (tickets/DM-10183=) $ git sub-patch
            commit 1270130835c593a220a1d8a825b764222a4159a2
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Tue Jul 18 17:33:44 2017 -0400
             
                Revert "Remove maxtasksperchild=1 from pool initialization"
                
                This reverts commit 6aa1fbc9c1675aa82044fa92094be8d83f9eca46.
                
                With DM-10834 fixed, this workaround isn't necessary any more.
             
            diff --git a/python/lsst/meas/mosaic/mosaicTask.py b/python/lsst/meas/mosaic/mosaicTask.py
            index cc7b662..1f4b30c 100644
            --- a/python/lsst/meas/mosaic/mosaicTask.py
            +++ b/python/lsst/meas/mosaic/mosaicTask.py
            @@ -547,7 +547,7 @@ class MosaicTask(pipeBase.CmdLineTask):
                         params.append((sourceReader, dataRef))
             
                     if numCoresForReadSource > 1:
            -            pool = multiprocessing.Pool(processes=numCoresForReadSource)
            +            pool = multiprocessing.Pool(processes=numCoresForReadSource, maxtasksperchild=1)
                         worker = Worker()
                         resultList = pool.map_async(worker, params).get(readTimeout)
                         pool.close()
            

            Show
            price Paul Price added a comment - As John Swinbank suspected, with DM-10834 fixed and DM-10161 reverted, this works just fine: pprice@tiger-sumire:/tigress/pprice/dm-10183 $ mosaic.py /tigress/HSC/users/price/bell/DATA/rerun/20170630/m81 --output /tigress/pprice/dm-10183/DATA -c allowMixedFilters=True --numCoresForRead=12 --diagnostics --diagDir=/tigress/pprice/dm-10183/diag --id tract=0 field=M81_1^M81_2^M81_F1^M81_F2^M81_F3^M81_F4 filter=HSC-I^HSC-I2 ccd=0..8^10..103 --no-versions [...] solveMosaic_CCD: 3th iteration calcChi2: 1.896429e-04 1.899754e-04 solveMosaic_CCD: 3th iteration matched: 0.063 (arcsec) sources: 0.011 (arcsec) nreject = 699 nreject = 571 Mosaic INFO: Write New WCS ... Mosaic INFO: Output WCS Diagnostic Figures... [...] fluxFitAbsolute CCD 100: 1.00563 fluxFitAbsolute CCD 101: 1.00029 fluxFitAbsolute CCD 102: 1.00601 fluxFitAbsolute CCD 103: 1.00048 Mosaic INFO: Write Fcr ... Mosaic INFO: Output Flux Diagnostic Figures... /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:827: RuntimeWarning: invalid value encountered in divide ra /= numbers /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:828: RuntimeWarning: invalid value encountered in divide dec /= numbers /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:829: RuntimeWarning: invalid value encountered in divide mag /= err /tigress/pprice/dm-10183/meas_mosaic/python/lsst/meas/mosaic/utils.py:830: RuntimeWarning: invalid value encountered in sqrt err = numpy.sqrt((var - mag*mag*err)/err) pprice@tiger-sumire:/tigress/pprice/dm-10183 $ The code on the ticket branch of meas_mosaic is a revert of DM-10161 . Tim Morton , could you please sign off on this? pprice@tiger-sumire:/tigress/pprice/dm-10183/meas_mosaic (tickets/DM-10183=) $ git sub-patch commit 1270130835c593a220a1d8a825b764222a4159a2 Author: Paul Price <price@astro.princeton.edu> Date: Tue Jul 18 17:33:44 2017 -0400   Revert "Remove maxtasksperchild=1 from pool initialization" This reverts commit 6aa1fbc9c1675aa82044fa92094be8d83f9eca46. With DM-10834 fixed, this workaround isn't necessary any more.   diff --git a/python/lsst/meas/mosaic/mosaicTask.py b/python/lsst/meas/mosaic/mosaicTask.py index cc7b662..1f4b30c 100644 --- a/python/lsst/meas/mosaic/mosaicTask.py +++ b/python/lsst/meas/mosaic/mosaicTask.py @@ -547,7 +547,7 @@ class MosaicTask(pipeBase.CmdLineTask): params.append((sourceReader, dataRef)) if numCoresForReadSource > 1: - pool = multiprocessing.Pool(processes=numCoresForReadSource) + pool = multiprocessing.Pool(processes=numCoresForReadSource, maxtasksperchild=1) worker = Worker() resultList = pool.map_async(worker, params).get(readTimeout) pool.close()
            Hide
            tmorton Tim Morton added a comment -

            Signing off on this, finally. Sorry for not realizing this was still waiting on me.

            Show
            tmorton Tim Morton added a comment - Signing off on this, finally. Sorry for not realizing this was still waiting on me.
            Hide
            price Paul Price added a comment -

            Just discovered that I hadn't merged this, and did so.

            Show
            price Paul Price added a comment - Just discovered that I hadn't merged this, and did so.

              People

              • Assignee:
                price Paul Price
                Reporter:
                tmorton Tim Morton
                Reviewers:
                Tim Morton
                Watchers:
                John Swinbank, Paul Price, Tim Morton
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel