# Investigate PSF cache misses

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Team:
External

#### Description

Profiling reveals that Psf::computeKernelImage is the tall pole in measureCoaddSources.py, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for Psf::computeKernelImage and Psf::computeShape (which calls Psf::computeKernelImage) show multiple callers with significant contributions, which suggests that the caching functionality in Psf is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

  0.0 .......... 0.01 / 0.36 lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point, lsst::afw::image::Color) const [2074]  0.0 .......... 0.13 / 32.09 lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [347]  7.1 .......... 745.86 / 752.92 lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [160]  7.3 .......... 772.18 / 775.05 lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point const&, lsst::afw::image::Color const&) const [156]  33.1 .......... 3'482.05 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point const&, lsst::afw::image::Color const&) const [73] [67] 47.5 5'000.21 0.07 / 5'000.14 lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const  47.5 .......... 4'999.65 / 4'999.65 lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point const&, lsst::afw::image::Color const&) const [68]  0.0 .......... 0.38 / 21.81 lsst::afw::image::Image::Image(lsst::afw::image::Image const&, bool) [411]  0.0 .......... 0.10 / 3.91 lsst::afw::image::Image::~Image() [759]  0.0 .......... 0.01 / 136.99 operator new(unsigned long) [229]  0.0 .......... 0.01 / 0.05 std::_Sp_counted_ptr_inplace, std::allocator >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -     2.5 .......... 263.99 / 877.79 lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [143]  5.6 .......... 587.63 / 587.69 lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr const&, lsst::afw::geom::Point const&, double) [171]  7.2 .......... 757.15 / 778.02 lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [154]  7.2 .......... 760.64 / 763.68 lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [158]  7.4 .......... 774.28 / 799.98 lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure const&) const [151] [72] 33.2 3'495.98 0.03 / 3'495.95 lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point, lsst::afw::image::Color) const  33.2 .......... 3'495.95 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point const&, lsst::afw::image::Color const&) const [73] 

#### Activity

Paul Price created issue -
Field Original Value New Value
Description Profiling reveals that {{Psf::computeKernelImage}} is the tall pole in {{measureCoaddSources.py}}, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for {{Psf::computeKernelImage}} and {{Psf::computeShape}} (which calls {{Psf::computeKernelImage}}) show multiple callers with significant contributions, which suggests that the caching functionality in {{Psf}} is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

{code}
0.0 .......... 0.01 / 0.36 lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const [2074]
0.0 .......... 0.13 / 32.09 lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [347]
7.1 .......... 745.86 / 752.92 lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [160]
7.3 .......... 772.18 / 775.05 lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [156]
33.1 .......... 3'482.05 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
[67] 47.5 5'000.21 0.07 / 5'000.14 lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const
47.5 .......... 4'999.65 / 4'999.65 lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [68]
0.0 .......... 0.38 / 21.81 lsst::afw::image::Image<double>::Image(lsst::afw::image::Image<double> const&, bool) [411]
0.0 .......... 0.10 / 3.91 lsst::afw::image::Image<double>::~Image() [759]
0.0 .......... 0.01 / 136.99 operator new(unsigned long) [229]
0.0 .......... 0.01 / 0.05 std::_Sp_counted_ptr_inplace<lsst::afw::image::Image<double>, std::allocator<lsst::afw::image::Image<double> >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

2.5 .......... 263.99 / 877.79 lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [143]
5.6 .......... 587.63 / 587.69 lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr<lsst::afw::detection::Psf const> const&, lsst::afw::geom::Point<double, 2> const&, double) [171]
7.2 .......... 757.15 / 778.02 lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [154]
7.2 .......... 760.64 / 763.68 lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [158]
7.4 .......... 774.28 / 799.98 lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [151]
[72] 33.2 3'495.98 0.03 / 3'495.95 lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const
33.2 .......... 3'495.95 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
{code}
Profiling reveals that {{Psf::computeKernelImage}} is the tall pole in {{measureCoaddSources.py}}, taking 47.5% of the runtime (much of this is due to inefficiencies in AST; DM-13847). The calling graphs for {{Psf::computeKernelImage}} and {{Psf::computeShape}} (which calls {{Psf::computeKernelImage}}) show multiple callers with significant contributions, which suggests that the caching functionality in {{Psf}} is not as effective as we desire. Investigate these cache misses and devise a plan for improving the cache hits.

{code}
0.0 .......... 0.01 / 0.36 lsst::afw::detection::Psf::getLocalKernel(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const [2074]
0.0 .......... 0.13 / 32.09 lsst::meas::modelfit::DoubleShapeletPsfApproxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [347]
7.1 .......... 745.86 / 752.92 lsst::meas::extensions::shapeHSM::HsmPsfMomentsAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [160]
7.3 .......... 772.18 / 775.05 lsst::afw::detection::Psf::doComputeImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [156]
33.1 .......... 3'482.05 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
[67] 47.5 5'000.21 0.07 / 5'000.14 lsst::afw::detection::Psf::computeKernelImage(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color, lsst::afw::detection::Psf::ImageOwnerEnum) const
47.5 .......... 4'999.65 / 4'999.65 lsst::meas::algorithms::CoaddPsf::doComputeKernelImage(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [68]
0.0 .......... 0.38 / 21.81 lsst::afw::image::Image<double>::Image(lsst::afw::image::Image<double> const&, bool) [411]
0.0 .......... 0.10 / 3.91 lsst::afw::image::Image<double>::~Image() [759]
0.0 .......... 0.01 / 136.99 operator new(unsigned long) [229]
0.0 .......... 0.01 / 0.05 std::_Sp_counted_ptr_inplace<lsst::afw::image::Image<double>, std::allocator<lsst::afw::image::Image<double> >, (__gnu_cxx::_Lock_policy)2>::_M_destroy() [3651]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

2.5 .......... 263.99 / 877.79 lsst::meas::extensions::photometryKron::KronFluxAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [143]
5.6 .......... 587.63 / 587.69 lsst::meas::extensions::photometryKron::calculatePsfKronRadius(std::shared_ptr<lsst::afw::detection::Psf const> const&, lsst::afw::geom::Point<double, 2> const&, double) [171]
7.2 .......... 757.15 / 778.02 lsst::meas::base::SdssShapeAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [154]
7.2 .......... 760.64 / 763.68 lsst::meas::base::LocalBackgroundAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [158]
7.4 .......... 774.28 / 799.98 lsst::meas::base::SdssCentroidAlgorithm::measure(lsst::afw::table::SourceRecord&, lsst::afw::image::Exposure<float, int, float> const&) const [151]
[72] 33.2 3'495.98 0.03 / 3'495.95 lsst::afw::detection::Psf::computeShape(lsst::afw::geom::Point<double, 2>, lsst::afw::image::Color) const
33.2 .......... 3'495.95 / 3'495.95 lsst::meas::algorithms::ImagePsf::doComputeShape(lsst::afw::geom::Point<double, 2> const&, lsst::afw::image::Color const&) const [73]
{code}
Hide
Paul Price added a comment -

There are places in the deblender that call psf.computeShape(), i.e., with no position, and so the average position is used. The PSF at the average position is often not cached, because it's not requested in succession. We should cache the PSF at the average position explicitly.

Show
Paul Price added a comment - There are places in the deblender that call psf.computeShape() , i.e., with no position, and so the average position is used. The PSF at the average position is often not cached, because it's not requested in succession. We should cache the PSF at the average position explicitly.
Hide
Paul Price added a comment -

Increasing the Psf cache size from 1 to 100 decreases the runtime from 10,517 sec to 8,137 sec (22% gain), and raises ast::Mapping::_tran to the tallest pole, at 37.7%.

Show
Paul Price added a comment - Increasing the Psf cache size from 1 to 100 decreases the runtime from 10,517 sec to 8,137 sec (22% gain), and raises ast::Mapping::_tran to the tallest pole, at 37.7%.
Hide
Paul Price added a comment -

Increasing the CoaddPsf cache size to 100,000 (and leaving the default Psf cache size at 100) gains us a bit more: now 7,781 sec (25% gain). ast::Mapping::_tran rises to 39.2%.

For my particular test data set (tract=8764 patch=4,4 filter=HSC-I), there are 11,074 footprints and 25,233 sources. The following cache statistics are from my cache simulator. There is a little bit of uncertainty here because the numbers from my simulator don't match the numbers from the actual program, but are close (I believe the difference is in the floating-point precision: the positions I have saved only have maybe one decimal point, while the actual program will see more than that).

Statistic computeKernelImage computeImage
Total cache requests 385,306 68,952
Unique requests 62,962 42,126
Cache 1 223,153 21,874
Cache 10 306,082 23,876
Cache 100 308,118 24,003
Cache 1000 308,146 24,005
Cache 10000 308,146 24,005
Cache 30000 312,588 26,826
Cache 100000 322,344 26,826
Show
Paul Price added a comment - Increasing the CoaddPsf cache size to 100,000 (and leaving the default Psf cache size at 100) gains us a bit more: now 7,781 sec (25% gain). ast::Mapping::_tran rises to 39.2%. For my particular test data set ( tract=8764 patch=4,4 filter=HSC-I ), there are 11,074 footprints and 25,233 sources. The following cache statistics are from my cache simulator. There is a little bit of uncertainty here because the numbers from my simulator don't match the numbers from the actual program, but are close (I believe the difference is in the floating-point precision: the positions I have saved only have maybe one decimal point, while the actual program will see more than that). Statistic computeKernelImage computeImage Total cache requests 385,306 68,952 Unique requests 62,962 42,126 Cache 1 223,153 21,874 Cache 10 306,082 23,876 Cache 100 308,118 24,003 Cache 1000 308,146 24,005 Cache 10000 308,146 24,005 Cache 30000 312,588 26,826 Cache 100000 322,344 26,826
 Link This issue blocks DM-13665 [ DM-13665 ]
Hide
Paul Price added a comment -

Thanks for volunteering to review this, Jim Bosch.

 price@pap-laptop:~/LSST/utils (tickets/DM-13854=) $git sub commit 127b605493f62c995f0129f48801a141eb3cf5ac Author: Paul Price  Date: Wed Mar 21 12:13:47 2018 -0400    add Cache class    Cache of most recently used objects, hashed by some key, with user-defined  capacity limit.    include/lsst/utils/Cache.h | 187 ++++++++++++++++++++++++++++++++++++++  include/lsst/utils/python/Cache.h | 39 ++++++++  tests/SConscript | 4 +-  tests/cache.cc | 8 ++  tests/test_cache.py | 131 ++++++++++++++++++++++++++  5 files changed, 367 insertions(+), 2 deletions(-)   commit e451ad11c1d72aefd79d9f3aeaeb252460938174 (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price  Date: Wed Mar 21 19:27:27 2018 -0400    Cache: add compile-time debug instrumentation    Allows us to characterise the cache performance.    examples/simulateCache.py | 52 +++++++++++++++++++++++++++++++++++++  include/lsst/utils/Cache.h | 65 +++++++++++++++++++++++++++++++++++++++++++++-  2 files changed, 116 insertions(+), 1 deletion(-)     price@pap-laptop:~/LSST/afw (tickets/DM-13854=)$ git sub commit e7c6d221064eb5084d50aca9b9d6bfcb96460208 Author: Paul Price  Date: Thu Mar 22 11:51:00 2018 -0400    make Point hashable    Allows its use as a key in lsst::utils::Cache.    include/lsst/afw/geom/Point.h | 4 +++  python/lsst/afw/geom/coordinates/coordinates.cc | 4 +++  src/geom/Point.cc | 33 ++++++++++++++++---------  tests/test_coordinates.py | 9 +++++++  4 files changed, 38 insertions(+), 12 deletions(-)   commit d85aeb25cec760584f9b0f9cdd0b48be236625cc (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price  Date: Tue Mar 20 17:10:49 2018 -0400    Psf: use new lsst::utils::Cache    This allows caching many more realisations (they're relatively cheap:  50x50 pixels per realisation @ 8 bytes per pixel); the number of  realisations is configurable.    This gains about 22% in the runtime of measureCoaddSources.py using the  default 100 cache slots for all Psf instances. Allowing CoaddPsf to use  100000 cache slots increases the gain to about 25%.    include/lsst/afw/detection/Psf.h | 45 +++++++++++++++++++++++++++++++++------  python/lsst/afw/detection/psf.cc | 2 ++  src/detection/Psf.cc | 46 ++++++++++++++--------------------------  3 files changed, 56 insertions(+), 37 deletions(-)     price@pap-laptop:~/LSST/meas_base (tickets/DM-13854=) $git sub commit 19181402b59be4fc6cd36daf34d0e8cb451ffd1a (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price  Date: Thu Mar 22 10:47:11 2018 -0400    forcedPhotCoadd: add command-line PSF cache configuration    The size of the PSF cache can have a significant effect upon the runtime  for complicated PSF models. For coadds, CoaddPsf can be complicated.    python/lsst/meas/base/forcedPhotCoadd.py | 9 +++++++++  python/lsst/meas/base/forcedPhotImage.py | 6 +++++-  2 files changed, 14 insertions(+), 1 deletion(-)     price@pap-laptop:~/LSST/pipe_tasks (tickets/DM-13854=)$ git sub commit c43de5c2fb866995736cfd8eb5f7a7a2cd1dc09e (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price  Date: Wed Mar 21 23:30:37 2018 -0400    measureCoaddSources: add command-line PSF cache configuration    CoaddPsf is the current tall pole in running measureCoaddSources.  Caching it is simple, inexpensive and realises a hefty gain (25%)  in the runtime.    python/lsst/pipe/tasks/multiBand.py | 13 +++++++++++--  1 file changed, 11 insertions(+), 2 deletions(-) `

Show
Paul Price added a comment - Thanks for volunteering to review this, Jim Bosch . price@pap-laptop:~/LSST/utils (tickets/DM-13854=) $git sub commit 127b605493f62c995f0129f48801a141eb3cf5ac Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 12:13:47 2018 -0400 add Cache class Cache of most recently used objects, hashed by some key, with user-defined capacity limit. include/lsst/utils/Cache.h | 187 ++++++++++++++++++++++++++++++++++++++ include/lsst/utils/python/Cache.h | 39 ++++++++ tests/SConscript | 4 +- tests/cache.cc | 8 ++ tests/test_cache.py | 131 ++++++++++++++++++++++++++ 5 files changed, 367 insertions(+), 2 deletions(-) commit e451ad11c1d72aefd79d9f3aeaeb252460938174 (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 19:27:27 2018 -0400 Cache: add compile-time debug instrumentation Allows us to characterise the cache performance. examples/simulateCache.py | 52 +++++++++++++++++++++++++++++++++++++ include/lsst/utils/Cache.h | 65 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 116 insertions(+), 1 deletion(-) price@pap-laptop:~/LSST/afw (tickets/DM-13854=)$ git sub commit e7c6d221064eb5084d50aca9b9d6bfcb96460208 Author: Paul Price <price@astro.princeton.edu> Date: Thu Mar 22 11:51:00 2018 -0400   make Point hashable Allows its use as a key in lsst::utils::Cache.   include/lsst/afw/geom/Point.h | 4 +++ python/lsst/afw/geom/coordinates/coordinates.cc | 4 +++ src/geom/Point.cc | 33 ++++++++++++++++--------- tests/test_coordinates.py | 9 +++++++ 4 files changed, 38 insertions(+), 12 deletions(-)   commit d85aeb25cec760584f9b0f9cdd0b48be236625cc (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Tue Mar 20 17:10:49 2018 -0400   Psf: use new lsst::utils::Cache This allows caching many more realisations (they're relatively cheap: 50x50 pixels per realisation @ 8 bytes per pixel); the number of realisations is configurable. This gains about 22% in the runtime of measureCoaddSources.py using the default 100 cache slots for all Psf instances. Allowing CoaddPsf to use 100000 cache slots increases the gain to about 25%.   include/lsst/afw/detection/Psf.h | 45 +++++++++++++++++++++++++++++++++------ python/lsst/afw/detection/psf.cc | 2 ++ src/detection/Psf.cc | 46 ++++++++++++++-------------------------- 3 files changed, 56 insertions(+), 37 deletions(-)     price@pap-laptop:~/LSST/meas_base (tickets/DM-13854=) $git sub commit 19181402b59be4fc6cd36daf34d0e8cb451ffd1a (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Thu Mar 22 10:47:11 2018 -0400 forcedPhotCoadd: add command-line PSF cache configuration The size of the PSF cache can have a significant effect upon the runtime for complicated PSF models. For coadds, CoaddPsf can be complicated. python/lsst/meas/base/forcedPhotCoadd.py | 9 +++++++++ python/lsst/meas/base/forcedPhotImage.py | 6 +++++- 2 files changed, 14 insertions(+), 1 deletion(-) price@pap-laptop:~/LSST/pipe_tasks (tickets/DM-13854=)$ git sub commit c43de5c2fb866995736cfd8eb5f7a7a2cd1dc09e (HEAD -> tickets/DM-13854, origin/tickets/DM-13854) Author: Paul Price <price@astro.princeton.edu> Date: Wed Mar 21 23:30:37 2018 -0400   measureCoaddSources: add command-line PSF cache configuration CoaddPsf is the current tall pole in running measureCoaddSources. Caching it is simple, inexpensive and realises a hefty gain (25%) in the runtime.   python/lsst/pipe/tasks/multiBand.py | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
 Reviewers Jim Bosch [ jbosch ] Status To Do [ 10001 ] In Review [ 10004 ]
Hide
Jim Bosch added a comment - - edited

Comments on the PRs.  Three big ones (including one new one):

• I'd like to keep Cache.h from being included in other headers to keep it from affecting compile times.  (I doubt it'd slow down things appreciably given what else Psf.h includes, but good practices have to start somewhere).
• We might want to try fuzzing the comparisons the computeKernelImage cache.  We should not have as many distinct evaluations as we do now.
• 100000 evaluations seems like an awfully big cache - that's over a gig of memory for a 41x41 PSF at double-precision, right?  I don't think we can afford two caches anywhere near that size.  Given that we're basically memory-bound, a 25% performance boost in exchange for going over the 4GB/core limit a lot more often doesn't seem likely it's clearly a good trade.  I'm hoping we can get most of that performance boost with maybe 10% of the cache size, especially if we fuzz the computeKernelImage comparisons.
Show
Jim Bosch added a comment - - edited Comments on the PRs.  Three big ones (including one new one): I'd like to keep Cache.h from being included in other headers to keep it from affecting compile times.  (I doubt it'd slow down things appreciably given what else Psf.h includes, but good practices have to start somewhere). We might want to try fuzzing the comparisons the computeKernelImage  cache.  We should not have as many distinct evaluations as we do now. 100000 evaluations seems like an awfully big cache - that's over a gig of memory for a 41x41 PSF at double-precision, right?  I don't think we can afford two caches anywhere near that size.  Given that we're basically memory-bound, a 25% performance boost in exchange for going over the 4GB/core limit a lot more often doesn't seem likely it's clearly a good trade.  I'm hoping we can get most of that performance boost with maybe 10% of the cache size, especially if we fuzz the computeKernelImage comparisons.
 Status In Review [ 10004 ] Reviewed [ 10101 ]
Hide
Paul Price added a comment -

Thanks for the review, Jim Bosch. I had hoped to get this in tonight, but that's not going to happen, as I'd like to go another round with you if you don't mind. Some replies to your points:

• I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions.
• I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately.
• measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.

I haven't cleaned up the commits yet, but I think the end product is in decent shape so please have a look over the weekend if you're able.

Show
Paul Price added a comment - Thanks for the review, Jim Bosch . I had hoped to get this in tonight, but that's not going to happen, as I'd like to go another round with you if you don't mind. Some replies to your points: I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions. I would like to fuzz the comparisons (I'm thinking of just int(x/precision) , where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately. measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that. I haven't cleaned up the commits yet, but I think the end product is in decent shape so please have a look over the weekend if you're able.
Hide
Paul Price added a comment -

Jenkins is green.

Show
Paul Price added a comment - Jenkins is green.
Hide
Jim Bosch added a comment -

I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions.

I've left a few comments on the afw PR that will hopefully get you to the point where the changes in the new packages aren't necessary.  If not, I'm happy to help more; they definitely should not be necessary.

I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately.

+1

measureCoaddSources.py doesn't use a lot of memory: top on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.

That may be true in the usual case, but if we're having to configure maximum footprint sizes or peak counts for the deblender, it can't possibly be true in the worst case.  Glad that we can get most of the benefits from a much smaller cache.

Show
Jim Bosch added a comment - I put in the forward declaration. This requires small changes in an additional 4 packages in order to pull in the definitions. I've left a few comments on the afw PR that will hopefully get you to the point where the changes in the new packages aren't necessary.  If not, I'm happy to help more; they definitely should not be necessary. I would like to fuzz the comparisons (I'm thinking of just int(x/precision), where precision can be user-defined), but because I'm targeting this work for the HSC production run which is already underway, I would like to avoid making any scientific changes on this ticket, no matter how small we believe their effect to be. I will put in a ticket to fuzz the comparisons that can go in separately. +1 measureCoaddSources.py  doesn't use a lot of memory:  top  on tiger3 typically shows it using 1-and-a-bit percent of the total memory, which is a bit over a gigabyte. So though I think there's plenty of room for the cache, the table above shows rather modest gains for sizes about 100, so I backed it off to that.  That may be true in the usual case, but if we're having to configure maximum footprint sizes or peak counts for the deblender, it can't possibly be true in the worst case.  Glad that we can get most of the benefits from a much smaller cache.
Hide
Paul Price added a comment - - edited

Thanks for walking me through this, Jim. I believe I've got it right now, and Jenkins is green. Do you want to have another quick look in afw?

Edit: I linked to the wrong Jenkins before. The proper Jenkins has one red, but that looks to be a system problem (despite multiple restarts) that I feel justified to ignore.

Show
Paul Price added a comment - - edited Thanks for walking me through this, Jim. I believe I've got it right now, and Jenkins is green. Do you want to have another quick look in afw? Edit: I linked to the wrong Jenkins before. The proper Jenkins has one red, but that looks to be a system problem (despite multiple restarts) that I feel justified to ignore.
Hide
Jim Bosch added a comment -

Yup, looks much better. I left two tiny comments on the afw PR, but these are just style issues now, not anything structural.

Show
Jim Bosch added a comment - Yup, looks much better. I left two tiny comments on the afw PR, but these are just style issues now, not anything structural.
Hide
Paul Price added a comment -

Final measured gain is 42.7% (10,517 sec --> 6,025 sec), in combination with DM-13847.

Show
Paul Price added a comment - Final measured gain is 42.7% (10,517 sec --> 6,025 sec), in combination with DM-13847 .
Hide
Paul Price added a comment -

Final Jenkins is green.

Merged to master.

Show
Paul Price added a comment - Final Jenkins is green. Merged to master.
 Resolution Done [ 10000 ] Status Reviewed [ 10101 ] Done [ 10002 ]
 Component/s meas_base [ 10750 ] Component/s pipe_tasks [ 10726 ] Component/s utils [ 10723 ]

#### People

Assignee:
Paul Price
Reporter:
Paul Price
Reviewers:
Jim Bosch
Watchers:
Jim Bosch, Paul Price