Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21167

Migrate ap_association fully to new DiaCalculation plugin system for time-series features.

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_association
    • Labels:
      None

      Description

      Remove pervious code in ap_association that produces static and time-series features and replace them with the new DiaCalculation measurement pluggin system created in DM-18494

        Attachments

          Issue Links

            Activity

            Hide
            cmorrison Chris Morrison added a comment -

            Ran both the current stack default and the new plugin system over two ccds (24 and 24) in the full HiTS2015 dataset. The goal is to test the relative processing speed of the new plugin system versus the previous implementation.

            Attached are two dot plots showing the percent of time spent in each part of the code during ap_pipe. In the default setup, the function `update_dia_objects` takes up 9.49% of the total processing time while the new plugin system takes up 6.46%, a 30% improvement.

            The function `update_dia_objects` includes both loading of DiaSources and storage of DiaObjects both of which have not changed and represent the same amount of overhead. The methods `_set_flux_stats` in the previous implementation and `diaCalculation.run` represent almost the same process with `diaCalculation.run` having slightly more calculations (updating mean positions, HTM indices, etc.). Even with these differences, `diaCalculation` is 5.79% of the processing vs `_set_flux_stats` taking 7.83%, again about a 30% reduction.

            The reason for this is possibly two fold: First the new plugins use the filter name as an index in the Pandas dataframe instead of creating a filter mask for each set of DiaSources. Second, a bug was discovered in the Stetson Mean calculation that could cause extra iterations in this step.

            Last test will be to check that the Ppdb has proper values within it.

            Previous implementation:

            w/ DiaCalculation plugins:

            Show
            cmorrison Chris Morrison added a comment - Ran both the current stack default and the new plugin system over two ccds (24 and 24) in the full HiTS2015 dataset. The goal is to test the relative processing speed of the new plugin system versus the previous implementation. Attached are two dot plots showing the percent of time spent in each part of the code during ap_pipe. In the default setup, the function `update_dia_objects` takes up 9.49% of the total processing time while the new plugin system takes up 6.46%, a 30% improvement. The function `update_dia_objects` includes both loading of DiaSources and storage of DiaObjects both of which have not changed and represent the same amount of overhead. The methods `_set_flux_stats` in the previous implementation and `diaCalculation.run` represent almost the same process with `diaCalculation.run` having slightly more calculations (updating mean positions, HTM indices, etc.). Even with these differences, `diaCalculation` is 5.79% of the processing vs `_set_flux_stats` taking 7.83%, again about a 30% reduction. The reason for this is possibly two fold: First the new plugins use the filter name as an index in the Pandas dataframe instead of creating a filter mask for each set of DiaSources. Second, a bug was discovered in the Stetson Mean calculation that could cause extra iterations in this step. Last test will be to check that the Ppdb has proper values within it. Previous implementation: w/ DiaCalculation plugins:
            Hide
            cmorrison Chris Morrison added a comment -

            Confirmed that the output values into the Ppdb are reasonable. Some slight differences are seen, however the overall shape of the values is the same. Considering that none of the previous calculations were unittested, I'm confident that the values from the plugins are the more accurate.

            Show
            cmorrison Chris Morrison added a comment - Confirmed that the output values into the Ppdb are reasonable. Some slight differences are seen, however the overall shape of the values is the same. Considering that none of the previous calculations were unittested, I'm confident that the values from the plugins are the more accurate.
            Show
            cmorrison Chris Morrison added a comment - Jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/30617/pipeline
            Hide
            swinbank John Swinbank added a comment -

            This is very nice! Thanks Chris Morrison!

            I left a few comments on GitHub, but they are basically all trivial issues.

            I also have a few comments on the commit structure:

            • Please squash e6c23c4 into a6c37a.
            • I've no objection to renaming “median” to “Percentile50”, but the commit message should explain why you chose to make this change.
            • 07e6666 is described as “Change median to Percentile50”. But, in fact, half that change seems to happen in a6c37407. Could you move all of these changes to the “change median to Percentile50” commit, please?
            • Why is 161bee4 a separate commit? I don't understand why it's not just part of a6c37407.

            These aren't just picky changes — I find it really useful to look at the commit structure and read a nice history of the work I'm reviewing, but here I just came away a bit confused.

            Show
            swinbank John Swinbank added a comment - This is very nice! Thanks Chris Morrison ! I left a few comments on GitHub, but they are basically all trivial issues. I also have a few comments on the commit structure: Please squash e6c23c4 into a6c37a . I've no objection to renaming “median” to “Percentile50”, but the commit message should explain why you chose to make this change. 07e6666 is described as “Change median to Percentile50”. But, in fact, half that change seems to happen in a6c37407 . Could you move all of these changes to the “change median to Percentile50” commit, please? Why is 161bee4 a separate commit? I don't understand why it's not just part of a6c37407 . These aren't just picky changes — I find it really useful to look at the commit structure and read a nice history of the work I'm reviewing, but here I just came away a bit confused.
            Hide
            swinbank John Swinbank added a comment -

            Also, the performance win is awesome!

            Show
            swinbank John Swinbank added a comment - Also, the performance win is awesome!
            Hide
            cmorrison Chris Morrison added a comment - - edited
            • Done
            • The name change happened in the previous ticket of plugin creation. Here is just a bug fix for compatibility with the afw table output from dax_ppdb.
            • See last point. It is a bug fix, so I can sqaush it into another commit if you want.
            • The commit was switching over to diaCalculation and making sure it worked with out the complications of some of the sub plugins failing. I then added them all in after the confirming that things worked and fixing the "median" bug. If you want I can squash the first 3 commits on the ticket into each other.

            Jenkins: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/30629/pipeline

            Show
            cmorrison Chris Morrison added a comment - - edited Done The name change happened in the previous ticket of plugin creation. Here is just a bug fix for compatibility with the afw table output from dax_ppdb. See last point. It is a bug fix, so I can sqaush it into another commit if you want. The commit was switching over to diaCalculation and making sure it worked with out the complications of some of the sub plugins failing. I then added them all in after the confirming that things worked and fixing the "median" bug. If you want I can squash the first 3 commits on the ticket into each other. Jenkins:  https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/30629/pipeline
            Hide
            swinbank John Swinbank added a comment -

            I would squash commits 1, 2, 3 & 5 (ie, including the “respond to review comments” into a single commit, then merge. Thank you!

            Show
            swinbank John Swinbank added a comment - I would squash commits 1, 2, 3 & 5 (ie, including the “respond to review comments” into a single commit, then merge. Thank you!

              People

              • Assignee:
                cmorrison Chris Morrison
                Reporter:
                cmorrison Chris Morrison
                Reviewers:
                John Swinbank
                Watchers:
                Chris Morrison, Eric Bellm, John Swinbank, Krzysztof Findeisen
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel