Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13854

Investigate PSF cache misses

    Details

    • Team:
      External

      Description

      Profiling reveals that Psf::computeKernelImage is the tall pole in measureCoaddSources.py, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for Psf::computeKernelImage and Psf::computeShape (which calls Psf::computeKernelImage) show multiple callers with significant contributions, which suggests that the caching functionality in Psf is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

                  0.0  ..........        0.01 / 0.36          lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const [2074]
                  0.0  ..........        0.13 / 32.09         lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [347]
                  7.1  ..........      745.86 / 752.92        lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [160]
                  7.3  ..........      772.18 / 775.05        lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [156]
                 33.1  ..........    3'482.05 / 3'495.95      lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
      [67]       47.5    5'000.21        0.07 / 5'000.14    lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const
                 47.5  ..........    4'999.65 / 4'999.65      lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [68]
                  0.0  ..........        0.38 / 21.81         lsst::afw::image::Image<double>::Image(lsst::afw::image::Image<double> const&, bool) [411]
                  0.0  ..........        0.10 / 3.91          lsst::afw::image::Image<double>::~Image() [759]
                  0.0  ..........        0.01 / 136.99        operator new(unsigned long) [229]
                  0.0  ..........        0.01 / 0.05          std::_Sp_counted_ptr_inplace<lsst::afw::image::Image<double>, std::allocator<lsst::afw::image::Image<double> >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]
       
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
       
                  2.5  ..........      263.99 / 877.79        lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [143]
                  5.6  ..........      587.63 / 587.69        lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr<lsst::afw::detection::Psf const> const&, lsst::afw::geom::Point<double, 2> const&, double) [171]
                  7.2  ..........      757.15 / 778.02        lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [154]
                  7.2  ..........      760.64 / 763.68        lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [158]
                  7.4  ..........      774.28 / 799.98        lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [151]
      [72]       33.2    3'495.98        0.03 / 3'495.95    lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const
                 33.2  ..........    3'495.95 / 3'495.95      lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
      

        Attachments

          Issue Links

            Activity

            price Paul Price created issue -
            price Paul Price made changes -
            Field Original Value New Value
            Description Profiling reveals that {{Psf::computeKernelImage}} is the tall pole in {{measureCoaddSources.py}}, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for {{Psf::computeKernelImage}} and {{Psf::computeShape}} (which calls {{Psf::computeKernelImage}}) show multiple callers with significant contributions, which suggests that the caching functionality in {{Psf}} is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

            {code}
                        0.0 .......... 0.01 / 0.36 lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const [2074]
                        0.0 .......... 0.13 / 32.09 lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [347]
                        7.1 .......... 745.86 / 752.92 lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [160]
                        7.3 .......... 772.18 / 775.05 lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [156]
                       33.1 .......... 3'482.05 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
            [67] 47.5 5'000.21 0.07 / 5'000.14 lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const
                       47.5 .......... 4'999.65 / 4'999.65 lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [68]
                        0.0 .......... 0.38 / 21.81 lsst::afw::image::Image<double>::Image(lsst::afw::image::Image<double> const&, bool) [411]
                        0.0 .......... 0.10 / 3.91 lsst::afw::image::Image<double>::~Image() [759]
                        0.0 .......... 0.01 / 136.99 operator new(unsigned long) [229]
                        0.0 .......... 0.01 / 0.05 std::_Sp_counted_ptr_inplace<lsst::afw::image::Image<double>, std::allocator<lsst::afw::image::Image<double> >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]

            - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                        2.5 .......... 263.99 / 877.79 lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [143]
                        5.6 .......... 587.63 / 587.69 lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr<lsst::afw::detection::Psf const> const&, lsst::afw::geom::Point<double, 2> const&, double) [171]
                        7.2 .......... 757.15 / 778.02 lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [154]
                        7.2 .......... 760.64 / 763.68 lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [158]
                        7.4 .......... 774.28 / 799.98 lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [151]
            [72] 33.2 3'495.98 0.03 / 3'495.95 lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const```
                       33.2 .......... 3'495.95 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
            {code}
            Profiling reveals that {{Psf::computeKernelImage}} is the tall pole in {{measureCoaddSources.py}}, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for {{Psf::computeKernelImage}} and {{Psf::computeShape}} (which calls {{Psf::computeKernelImage}}) show multiple callers with significant contributions, which suggests that the caching functionality in {{Psf}} is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

            {code}
                        0.0 .......... 0.01 / 0.36 lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const [2074]
                        0.0 .......... 0.13 / 32.09 lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [347]
                        7.1 .......... 745.86 / 752.92 lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [160]
                        7.3 .......... 772.18 / 775.05 lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [156]
                       33.1 .......... 3'482.05 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
            [67] 47.5 5'000.21 0.07 / 5'000.14 lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const
                       47.5 .......... 4'999.65 / 4'999.65 lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [68]
                        0.0 .......... 0.38 / 21.81 lsst::afw::image::Image<double>::Image(lsst::afw::image::Image<double> const&, bool) [411]
                        0.0 .......... 0.10 / 3.91 lsst::afw::image::Image<double>::~Image() [759]
                        0.0 .......... 0.01 / 136.99 operator new(unsigned long) [229]
                        0.0 .......... 0.01 / 0.05 std::_Sp_counted_ptr_inplace<lsst::afw::image::Image<double>, std::allocator<lsst::afw::image::Image<double> >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]

            - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                        2.5 .......... 263.99 / 877.79 lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [143]
                        5.6 .......... 587.63 / 587.69 lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr<lsst::afw::detection::Psf const> const&, lsst::afw::geom::Point<double, 2> const&, double) [171]
                        7.2 .......... 757.15 / 778.02 lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [154]
                        7.2 .......... 760.64 / 763.68 lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [158]
                        7.4 .......... 774.28 / 799.98 lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [151]
            [72] 33.2 3'495.98 0.03 / 3'495.95 lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const
                       33.2 .......... 3'495.95 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
            {code}
            Hide
            price Paul Price added a comment -

            There are places in the deblender that call psf.computeShape(), i.e., with no position, and so the average position is used. The PSF at the average position is often not cached, because it's not requested in succession. We should cache the PSF at the average position explicitly.

            Show
            price Paul Price added a comment - There are places in the deblender that call psf.computeShape() , i.e., with no position, and so the average position is used. The PSF at the average position is often not cached, because it's not requested in succession. We should cache the PSF at the average position explicitly.
            Hide
            price Paul Price added a comment -

            Increasing the Psf cache size from 1 to 100 decreases the runtime from 10,517 sec to 8,137 sec (22% gain), and raises ast::Mapping::_tran to the tallest pole, at 37.7%.

            Show
            price Paul Price added a comment - Increasing the Psf cache size from 1 to 100 decreases the runtime from 10,517 sec to 8,137 sec (22% gain), and raises ast::Mapping::_tran to the tallest pole, at 37.7%.
            Hide
            price Paul Price added a comment -

            Increasing the CoaddPsf cache size to 100,000 (and leaving the default Psf cache size at 100) gains us a bit more: now 7,781 sec (25% gain). ast::Mapping::_tran rises to 39.2%.

            For my particular test data set (tract=8764 patch=4,4 filter=HSC-I), there are 11,074 footprints and 25,233 sources. The following cache statistics are from my cache simulator. There is a little bit of uncertainty here because the numbers from my simulator don't match the numbers from the actual program, but are close (I believe the difference is in the floating-point precision: the positions I have saved only have maybe one decimal point, while the actual program will see more than that).

            Statistic computeKernelImage computeImage
            Total cache requests 385,306 68,952
            Unique requests 62,962 42,126
            Cache 1 223,153 21,874
            Cache 10 306,082 23,876
            Cache 100 308,118 24,003
            Cache 1000 308,146 24,005
            Cache 10000 308,146 24,005
            Cache 30000 312,588 26,826
            Cache 100000 322,344 26,826
            Show
            price Paul Price added a comment - Increasing the CoaddPsf cache size to 100,000 (and leaving the default Psf cache size at 100) gains us a bit more: now 7,781 sec (25% gain). ast::Mapping::_tran rises to 39.2%. For my particular test data set ( tract=8764 patch=4,4 filter=HSC-I ), there are 11,074 footprints and 25,233 sources. The following cache statistics are from my cache simulator. There is a little bit of uncertainty here because the numbers from my simulator don't match the numbers from the actual program, but are close (I believe the difference is in the floating-point precision: the positions I have saved only have maybe one decimal point, while the actual program will see more than that). Statistic computeKernelImage computeImage Total cache requests 385,306 68,952 Unique requests 62,962 42,126 Cache 1 223,153 21,874 Cache 10 306,082 23,876 Cache 100 308,118 24,003 Cache 1000 308,146 24,005 Cache 10000 308,146 24,005 Cache 30000 312,588 26,826 Cache 100000 322,344 26,826
            hchiang2 Hsin-Fang Chiang made changes -
            Link This issue blocks DM-13665 [ DM-13665 ]
            Hide
            price Paul Price added a comment -

            Thanks for volunteering to review this, Jim Bosch.

            price@pap-laptop:~/LSST/utils (tickets/DM-13854=) $ git sub
            commit 127b605493f62c995f0129f48801a141eb3cf5ac
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Wed Mar 21 12:13:47 2018 -0400
             
                add Cache class
                
                Cache of most recently used objects, hashed by some key, with user-defined
                capacity limit.
             
             include/lsst/utils/Cache.h        | 187 ++++++++++++++++++++++++++++++++++++++
             include/lsst/utils/python/Cache.h |  39 ++++++++
             tests/SConscript                  |   4 +-
             tests/cache.cc                    |   8 ++
             tests/test_cache.py               | 131 ++++++++++++++++++++++++++
             5 files changed, 367 insertions(+), 2 deletions(-)
             
            commit e451ad11c1d72aefd79d9f3aeaeb252460938174 (HEAD -> tickets/DM-13854, origin/tickets/DM-13854)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Wed Mar 21 19:27:27 2018 -0400
             
                Cache: add compile-time debug instrumentation
                
                Allows us to characterise the cache performance.
             
             examples/simulateCache.py  | 52 +++++++++++++++++++++++++++++++++++++
             include/lsst/utils/Cache.h | 65 +++++++++++++++++++++++++++++++++++++++++++++-
             2 files changed, 116 insertions(+), 1 deletion(-)
             
             
            price@pap-laptop:~/LSST/afw (tickets/DM-13854=) $ git sub
            commit e7c6d221064eb5084d50aca9b9d6bfcb96460208
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Thu Mar 22 11:51:00 2018 -0400
             
                make Point hashable
                
                Allows its use as a key in lsst::utils::Cache.
             
             include/lsst/afw/geom/Point.h                   |  4 +++
             python/lsst/afw/geom/coordinates/coordinates.cc |  4 +++
             src/geom/Point.cc                               | 33 ++++++++++++++++---------
             tests/test_coordinates.py                       |  9 +++++++
             4 files changed, 38 insertions(+), 12 deletions(-)
             
            commit d85aeb25cec760584f9b0f9cdd0b48be236625cc (HEAD -> tickets/DM-13854, origin/tickets/DM-13854)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Tue Mar 20 17:10:49 2018 -0400
             
                Psf: use new lsst::utils::Cache
                
                This allows caching many more realisations (they're relatively cheap:
                50x50 pixels per realisation @ 8 bytes per pixel); the number of
                realisations is configurable.
                
                This gains about 22% in the runtime of measureCoaddSources.py using the
                default 100 cache slots for all Psf instances. Allowing CoaddPsf to use
                100000 cache slots increases the gain to about 25%.
             
             include/lsst/afw/detection/Psf.h | 45 +++++++++++++++++++++++++++++++++------
             python/lsst/afw/detection/psf.cc |  2 ++
             src/detection/Psf.cc             | 46 ++++++++++++++--------------------------
             3 files changed, 56 insertions(+), 37 deletions(-)
             
             
            price@pap-laptop:~/LSST/meas_base (tickets/DM-13854=) $ git sub
            commit 19181402b59be4fc6cd36daf34d0e8cb451ffd1a (HEAD -> tickets/DM-13854, origin/tickets/DM-13854)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Thu Mar 22 10:47:11 2018 -0400
             
                forcedPhotCoadd: add command-line PSF cache configuration
                
                The size of the PSF cache can have a significant effect upon the runtime
                for complicated PSF models. For coadds, CoaddPsf can be complicated.
             
             python/lsst/meas/base/forcedPhotCoadd.py | 9 +++++++++
             python/lsst/meas/base/forcedPhotImage.py | 6 +++++-
             2 files changed, 14 insertions(+), 1 deletion(-)
             
             
            price@pap-laptop:~/LSST/pipe_tasks (tickets/DM-13854=) $ git sub
            commit c43de5c2fb866995736cfd8eb5f7a7a2cd1dc09e (HEAD -> tickets/DM-13854, origin/tickets/DM-13854)
            Author: Paul Price <price@astro.princeton.edu>
            Date:   Wed Mar 21 23:30:37 2018 -0400
             
                measureCoaddSources: add command-line PSF cache configuration
                
                CoaddPsf is the current tall pole in running measureCoaddSources.
                Caching it is simple, inexpensive and realises a hefty gain (25%)
                in the runtime.
             
             python/lsst/pipe/tasks/multiBand.py | 13 +++++++++++--
             1 file changed, 11 insertions(+), 2 deletions(-)
            

            Show
            price Paul Price added a comment - Thanks for volunteering to review this, Jim Bosch . price@pap-laptop:~/LSST/utils (tickets/DM-13854=) $ git sub commit 127b605493f62c995f0129f48801a141eb3cf5ac Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 12:13:47 2018 -0400   add Cache class Cache of most recently used objects, hashed by some key, with user-defined capacity limit.   include/lsst/utils/Cache.h | 187 ++++++++++++++++++++++++++++++++++++++ include/lsst/utils/python/Cache.h | 39 ++++++++ tests/SConscript | 4 +- tests/cache.cc | 8 ++ tests/test_cache.py | 131 ++++++++++++++++++++++++++ 5 files changed, 367 insertions(+), 2 deletions(-)   commit e451ad11c1d72aefd79d9f3aeaeb252460938174 (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 19:27:27 2018 -0400   Cache: add compile-time debug instrumentation Allows us to characterise the cache performance.   examples/simulateCache.py | 52 +++++++++++++++++++++++++++++++++++++ include/lsst/utils/Cache.h | 65 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 116 insertions(+), 1 deletion(-)     price@pap-laptop:~/LSST/afw (tickets/DM-13854=) $ git sub commit e7c6d221064eb5084d50aca9b9d6bfcb96460208 Author: Paul Price <price@astro.princeton.edu> Date: Thu Mar 22 11:51:00 2018 -0400   make Point hashable Allows its use as a key in lsst::utils::Cache.   include/lsst/afw/geom/Point.h | 4 +++ python/lsst/afw/geom/coordinates/coordinates.cc | 4 +++ src/geom/Point.cc | 33 ++++++++++++++++--------- tests/test_coordinates.py | 9 +++++++ 4 files changed, 38 insertions(+), 12 deletions(-)   commit d85aeb25cec760584f9b0f9cdd0b48be236625cc (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Tue Mar 20 17:10:49 2018 -0400   Psf: use new lsst::utils::Cache This allows caching many more realisations (they're relatively cheap: 50x50 pixels per realisation @ 8 bytes per pixel); the number of realisations is configurable. This gains about 22% in the runtime of measureCoaddSources.py using the default 100 cache slots for all Psf instances. Allowing CoaddPsf to use 100000 cache slots increases the gain to about 25%.   include/lsst/afw/detection/Psf.h | 45 +++++++++++++++++++++++++++++++++------ python/lsst/afw/detection/psf.cc | 2 ++ src/detection/Psf.cc | 46 ++++++++++++++-------------------------- 3 files changed, 56 insertions(+), 37 deletions(-)     price@pap-laptop:~/LSST/meas_base (tickets/DM-13854=) $ git sub commit 19181402b59be4fc6cd36daf34d0e8cb451ffd1a (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Thu Mar 22 10:47:11 2018 -0400   forcedPhotCoadd: add command-line PSF cache configuration The size of the PSF cache can have a significant effect upon the runtime for complicated PSF models. For coadds, CoaddPsf can be complicated.   python/lsst/meas/base/forcedPhotCoadd.py | 9 +++++++++ python/lsst/meas/base/forcedPhotImage.py | 6 +++++- 2 files changed, 14 insertions(+), 1 deletion(-)     price@pap-laptop:~/LSST/pipe_tasks (tickets/DM-13854=) $ git sub commit c43de5c2fb866995736cfd8eb5f7a7a2cd1dc09e (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 23:30:37 2018 -0400   measureCoaddSources: add command-line PSF cache configuration CoaddPsf is the current tall pole in running measureCoaddSources. Caching it is simple, inexpensive and realises a hefty gain (25%) in the runtime.   python/lsst/pipe/tasks/multiBand.py | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
            price Paul Price made changes -
            Reviewers Jim Bosch [ jbosch ]
            Status To Do [ 10001 ] In Review [ 10004 ]
            Hide
            jbosch Jim Bosch added a comment - - edited

            Comments on the PRs.  Three big ones (including one new one):

            • I'd like to keep Cache.h from being included in other headers to keep it from affecting compile times.  (I doubt it'd slow down things appreciably given what else Psf.h includes, but good practices have to start somewhere).
            • We might want to try fuzzing the comparisons the computeKernelImage cache.  We should not have as many distinct evaluations as we do now.
            • 100000 evaluations seems like an awfully big cache - that's over a gig of memory for a 41x41 PSF at double-precision, right?  I don't think we can afford two caches anywhere near that size.  Given that we're basically memory-bound, a 25% performance boost in exchange for going over the 4GB/core limit a lot more often doesn't seem likely it's clearly a good trade.  I'm hoping we can get most of that performance boost with maybe 10% of the cache size, especially if we fuzz the computeKernelImage comparisons.
            Show
            jbosch Jim Bosch added a comment - - edited Comments on the PRs.  Three big ones (including one new one): I'd like to keep Cache.h from being included in other headers to keep it from affecting compile times.  (I doubt it'd slow down things appreciably given what else Psf.h includes, but good practices have to start somewhere). We might want to try fuzzing the comparisons the computeKernelImage  cache.  We should not have as many distinct evaluations as we do now. 100000 evaluations seems like an awfully big cache - that's over a gig of memory for a 41x41 PSF at double-precision, right?  I don't think we can afford two caches anywhere near that size.  Given that we're basically memory-bound, a 25% performance boost in exchange for going over the 4GB/core limit a lot more often doesn't seem likely it's clearly a good trade.  I'm hoping we can get most of that performance boost with maybe 10% of the cache size, especially if we fuzz the computeKernelImage comparisons.
            jbosch Jim Bosch made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            Hide
            price Paul Price added a comment -

            Thanks for the review, Jim Bosch. I had hoped to get this in tonight, but that's not going to happen, as I'd like to go another round with you if you don't mind. Some replies to your points:

            • I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions.
            • I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately.
            • measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.

            I haven't cleaned up the commits yet, but I think the end product is in decent shape so please have a look over the weekend if you're able.

            Show
            price Paul Price added a comment - Thanks for the review, Jim Bosch . I had hoped to get this in tonight, but that's not going to happen, as I'd like to go another round with you if you don't mind. Some replies to your points: I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions. I would like to fuzz the comparisons (I'm thinking of just int(x/precision) , where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately. measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that. I haven't cleaned up the commits yet, but I think the end product is in decent shape so please have a look over the weekend if you're able.
            Hide
            price Paul Price added a comment -

            Jenkins is green.

            Show
            price Paul Price added a comment - Jenkins is green.
            Hide
            jbosch Jim Bosch added a comment -

            I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions.

            I've left a few comments on the afw PR that will hopefully get you to the point where the changes in the new packages aren't necessary.  If not, I'm happy to help more; they definitely should not be necessary.

            I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately.

            +1

            measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.

             That may be true in the usual case, but if we're having to configure maximum footprint sizes or peak counts for the deblender, it can't possibly be true in the worst case.  Glad that we can get most of the benefits from a much smaller cache.

             

            Show
            jbosch Jim Bosch added a comment - I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions. I've left a few comments on the afw PR that will hopefully get you to the point where the changes in the new packages aren't necessary.  If not, I'm happy to help more; they definitely should not be necessary. I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately. +1 measureCoaddSources.py  doesn't use a lot of memory:  top  on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.  That may be true in the usual case, but if we're having to configure maximum footprint sizes or peak counts for the deblender, it can't possibly be true in the worst case.  Glad that we can get most of the benefits from a much smaller cache.  
            Hide
            price Paul Price added a comment - - edited

            Thanks for walking me through this, Jim. I believe I've got it right now, and Jenkins is green. Do you want to have another quick look in afw?

            Edit: I linked to the wrong Jenkins before. The proper Jenkins has one red, but that looks to be a system problem (despite multiple restarts) that I feel justified to ignore.

            Show
            price Paul Price added a comment - - edited Thanks for walking me through this, Jim. I believe I've got it right now, and Jenkins is green. Do you want to have another quick look in afw? Edit: I linked to the wrong Jenkins before. The proper Jenkins has one red, but that looks to be a system problem (despite multiple restarts) that I feel justified to ignore.
            Hide
            jbosch Jim Bosch added a comment -

            Yup, looks much better. I left two tiny comments on the afw PR, but these are just style issues now, not anything structural.

            Show
            jbosch Jim Bosch added a comment - Yup, looks much better. I left two tiny comments on the afw PR, but these are just style issues now, not anything structural.
            Hide
            price Paul Price added a comment -

            Final measured gain is 42.7% (10,517 sec --> 6,025 sec), in combination with DM-13847.

            Show
            price Paul Price added a comment - Final measured gain is 42.7% (10,517 sec --> 6,025 sec), in combination with DM-13847 .
            price Paul Price made changes -
            Hide
            price Paul Price added a comment -

            Final Jenkins is green.

            Merged to master.

            Show
            price Paul Price added a comment - Final Jenkins is green. Merged to master.
            price Paul Price made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            price Paul Price made changes -
            Component/s meas_base [ 10750 ]
            Component/s pipe_tasks [ 10726 ]
            Component/s utils [ 10723 ]

              People

              • Assignee:
                price Paul Price
                Reporter:
                price Paul Price
                Reviewers:
                Jim Bosch
                Watchers:
                Jim Bosch, Paul Price
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel