Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-10843

Implement new afw::math::Statistics

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Won't Fix
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: afw
    • Labels:
      None
    • Story Points:
      8
    • Sprint:
      DRP F17-1, DRP F17-2, DRP F17-3, DRP F17-4, DRP F17-5
    • Team:
      Data Release Production

      Description

      Following agreement (DM-10842), implement the new design in afw.

        Attachments

          Issue Links

            Activity

            Hide
            swinbank John Swinbank added a comment -

            This includes getting to the same level of functionality as the old afw::math::Statistics with the same level of test coverage.

            Show
            swinbank John Swinbank added a comment - This includes getting to the same level of functionality as the old afw::math::Statistics with the same level of test coverage.
            Hide
            pschella Pim Schellart [X] (Inactive) added a comment -

            Slight scope change. First only do one overload (taking an afw::image::Image). Postponing the other overloads to a separate ticket to avoid duplicate work when interfaces are requested to change.

            Show
            pschella Pim Schellart [X] (Inactive) added a comment - Slight scope change. First only do one overload (taking an afw::image::Image ). Postponing the other overloads to a separate ticket to avoid duplicate work when interfaces are requested to change.
            Hide
            pschella Pim Schellart [X] (Inactive) added a comment -

            In review with updated scope.

            Show
            pschella Pim Schellart [X] (Inactive) added a comment - In review with updated scope.
            Hide
            rowen Russell Owen added a comment -

            C++ API

            I think it is great that the new API takes ndarrays as its primary source of data. That should make it easy to provide overloads for Image, MaskedImage and std::vector<double>.

            I am uneasy about the new StatisticsResult object. On the plus side:

            • It arguably makes code more readable than calling getResult, getValue, or getError.
            • It separates the result from the object that computed it, so there is no question of results being affected by calling methods on the statistics object.
            • It resolves a minor issue with for items that have no associated error, such as minimum value: getResult must either return a fake value for error (presumably 0) or raise an exception. StatisticsResult can simply omit such fields.

            However, the new code must provide a fake value in StatisticsResult for every statistic it did not compute. This really worries me. You do provide the bit mask, so it is practical for a user to find out if a given value was computed. But I suspect users will not always bother, resulting in subtle bugs.

            I can think of at least three ways to prevent users from accessing values that were not computed:

            • Return a PropertyList instead of a StatisticsResult. This makes value access a bit clumsier, but it's safe, straightforward, and one can easily get a list of values that were computed in readable form.
            • Return the result in an opaque object that has the old methods getValue, etc.
            • Use the old technique of getValue, etc., on the object that computes the statistics.

            I suggest putting the new statistics code in namespace lsst::afw::math or a new sub-package of afw, such as lsst::afw::statistics. lsst::afw::math::statistics seems too deep a namespace for such important code. One possibility is to keep the public API in lsst::afw::math and put the implementation into lsst::afw::math::detail.

            Template Metaprogramming

            Due to the use of template metaprogramming, the new code is quite complex in places. I'm not sure this complexity is justified.

            My understanding is that template metaprogramming is used for two reasons:

            • To make it easier to add a new statistic.
            • For optimal performance when computing minimal statistics, such as the minimum of a set of values.

            The new code should make it very easy to add new single-pass (e.g. non-clipped) statistics. But if I understand correctly, the primary gain in simplicity and performance arises if the new statistic does not make use of the existing common intermediate products (such as number of items, sum of values, mean...). If the new statistic uses standard intermediate products then I suspect the old code would also allow easy adding of the statistic and fine performance when computing it.

            Adding a clipped statistic is, of course, more work than adding a statistic than can be computed in a single pass. However, the additional work appears to be quite reasonable. I'm not sure how hard this would be with the old code.

            Adding a new mulit-pass statistic that does not rely on clipped mean and variance appears to be very difficult in the new code; I think it would require some refactoring. I don't know about the old code (but would be surprised if it was easy).

            I suspect any gains in performance will be limited to the case of C++ code computing a few really trivial items, such as the minimum and maximum value. If I understand correctly, any statistic that is more complicated, even mean, is likely to have very similar performance to the old code, and Python is unlikely to perform better when computing any statistic.

            Note: running simple statistics in C++ at maximum speed requires some care; if I understand correctly, one should use calculateStatistics and extract, instead of standardStatistics. That is a bit more complicated, though not too bad, due to very useful syntactic sugar provided by extract.

            Thus I am not entirely convinced the template metaprogramming is justified. I admit the old code is also complicated in places, and some of that complexity may be intrinsic (rather than things that could be cleaned up by refactoring). But overall I think it is easier to understand the old code.

            If we do go this route I think it essential to carefully document the various helper classes and provide an overview of the design.

            It will also help to rename a few of the more confusing classes, such as ConcreteProduct (which sounds like the non-abstract version of Product but is not) and None which is a null product, and so should probably be called NullProduct or NoProduct.

            Speaking of which...is None necessary? If it is practical to make a ProductSet with no entries, I think that would be more elegant than defining an empty product set as one that contains None.

            pybind11 wrapper for "standardStatistics"

            I strongly suggest you return to using a bit mask to define the desired statistics, instead of having one boolean argument for each possible statistic. It may be slightly more verbose, especially to compute only one or two statistics, but I think it makes the API cleaner. Using keyword arguments has the risk that a user may provide positional values (at least in Python 2), resulting in unreadable code. Also, having so many arguments for a function is not usually recommended, and I don't think it is needed here. In addition to possibly improving the API, using a bit mask simplifies the wrapper code, removing a lot of boilerplate.

            Consider returning a pipe_base Struct instead of a dict of values. This simplifes access to the values e.g. mean = standardStatistics(image, MEAN).mean vs. mean = standardStatistics(image, MEAN)["mean"]. Please do not let dependencies be an argument against using Struct; afw should, perhaps, not depend on pipe_base, but Struct could easily be moved into daf_base, with an alias in pipe_base for backwards compatibility. We have found other uses for Struct and I intend to RFC moving it.

            Show
            rowen Russell Owen added a comment - C++ API I think it is great that the new API takes ndarrays as its primary source of data. That should make it easy to provide overloads for Image , MaskedImage and std::vector<double> . I am uneasy about the new StatisticsResult object. On the plus side: It arguably makes code more readable than calling getResult , getValue , or getError . It separates the result from the object that computed it, so there is no question of results being affected by calling methods on the statistics object. It resolves a minor issue with for items that have no associated error, such as minimum value: getResult must either return a fake value for error (presumably 0) or raise an exception. StatisticsResult can simply omit such fields. However, the new code must provide a fake value in StatisticsResult for every statistic it did not compute. This really worries me. You do provide the bit mask, so it is practical for a user to find out if a given value was computed. But I suspect users will not always bother, resulting in subtle bugs. I can think of at least three ways to prevent users from accessing values that were not computed: Return a PropertyList instead of a StatisticsResult . This makes value access a bit clumsier, but it's safe, straightforward, and one can easily get a list of values that were computed in readable form. Return the result in an opaque object that has the old methods getValue , etc. Use the old technique of getValue , etc., on the object that computes the statistics. I suggest putting the new statistics code in namespace lsst::afw::math or a new sub-package of afw , such as lsst::afw::statistics . lsst::afw::math::statistics seems too deep a namespace for such important code. One possibility is to keep the public API in lsst::afw::math and put the implementation into lsst::afw::math::detail . Template Metaprogramming Due to the use of template metaprogramming, the new code is quite complex in places. I'm not sure this complexity is justified. My understanding is that template metaprogramming is used for two reasons: To make it easier to add a new statistic. For optimal performance when computing minimal statistics, such as the minimum of a set of values. The new code should make it very easy to add new single-pass (e.g. non-clipped) statistics. But if I understand correctly, the primary gain in simplicity and performance arises if the new statistic does not make use of the existing common intermediate products (such as number of items, sum of values, mean...). If the new statistic uses standard intermediate products then I suspect the old code would also allow easy adding of the statistic and fine performance when computing it. Adding a clipped statistic is, of course, more work than adding a statistic than can be computed in a single pass. However, the additional work appears to be quite reasonable. I'm not sure how hard this would be with the old code. Adding a new mulit-pass statistic that does not rely on clipped mean and variance appears to be very difficult in the new code; I think it would require some refactoring. I don't know about the old code (but would be surprised if it was easy). I suspect any gains in performance will be limited to the case of C++ code computing a few really trivial items, such as the minimum and maximum value. If I understand correctly, any statistic that is more complicated, even mean, is likely to have very similar performance to the old code, and Python is unlikely to perform better when computing any statistic. Note: running simple statistics in C++ at maximum speed requires some care; if I understand correctly, one should use calculateStatistics and extract , instead of standardStatistics . That is a bit more complicated, though not too bad, due to very useful syntactic sugar provided by extract . Thus I am not entirely convinced the template metaprogramming is justified. I admit the old code is also complicated in places, and some of that complexity may be intrinsic (rather than things that could be cleaned up by refactoring). But overall I think it is easier to understand the old code. If we do go this route I think it essential to carefully document the various helper classes and provide an overview of the design. It will also help to rename a few of the more confusing classes, such as ConcreteProduct (which sounds like the non-abstract version of Product but is not) and None which is a null product, and so should probably be called NullProduct or NoProduct . Speaking of which...is None necessary? If it is practical to make a ProductSet with no entries, I think that would be more elegant than defining an empty product set as one that contains None . pybind11 wrapper for "standardStatistics" I strongly suggest you return to using a bit mask to define the desired statistics, instead of having one boolean argument for each possible statistic. It may be slightly more verbose, especially to compute only one or two statistics, but I think it makes the API cleaner. Using keyword arguments has the risk that a user may provide positional values (at least in Python 2), resulting in unreadable code. Also, having so many arguments for a function is not usually recommended, and I don't think it is needed here. In addition to possibly improving the API, using a bit mask simplifies the wrapper code, removing a lot of boilerplate. Consider returning a pipe_base Struct instead of a dict of values. This simplifes access to the values e.g. mean = standardStatistics(image, MEAN).mean vs. mean = standardStatistics(image, MEAN)["mean"] . Please do not let dependencies be an argument against using Struct ; afw should, perhaps, not depend on pipe_base , but Struct could easily be moved into daf_base , with an alias in pipe_base for backwards compatibility. We have found other uses for Struct and I intend to RFC moving it.
            Hide
            rowen Russell Owen added a comment -

            One more thing: please make quartiles a dependency (intermediate product of) clipped mean and clipped variance. That will allow asking for just clipped mean or variance, and allows the code that implements clipping to not have to test if the quartiles product is present.

            Show
            rowen Russell Owen added a comment - One more thing: please make quartiles a dependency (intermediate product of) clipped mean and clipped variance. That will allow asking for just clipped mean or variance, and allows the code that implements clipping to not have to test if the quartiles product is present.
            Hide
            jbosch Jim Bosch added a comment -

            With great sadness, I'm declaring this ticket Won't Fix. I don't see anybody picking up Pim's old branch, and I don't see us assigning anyone to do another total overhaul. We will fix problems as they come up, and for major changes we should strongly consider solving them individually outside of the context of afw::math::Statistics to ensure we have a project that can converge (and avoid making the maintenance mess of the existing Statistics class worse). In the end, I think maybe having one Statistics class that solves all problems while guaranteeing best-possible performance for all of them is just too ambitious.

            Show
            jbosch Jim Bosch added a comment - With great sadness, I'm declaring this ticket Won't Fix. I don't see anybody picking up Pim's old branch, and I don't see us assigning anyone to do another total overhaul. We will fix problems as they come up, and for major changes we should strongly consider solving them individually outside of the context of afw::math::Statistics to ensure we have a project that can converge (and avoid making the maintenance mess of the existing Statistics class worse). In the end, I think maybe having one Statistics class that solves all problems while guaranteeing best-possible performance for all of them is just too ambitious.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              swinbank John Swinbank
              Reviewers:
              Russell Owen
              Watchers:
              Jim Bosch, Pim Schellart [X] (Inactive), Russell Owen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins Builds

                  No builds found.