Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-5819

Incorporate Price suggestions to make `validate_drp` faster

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Validation
    • Labels:
      None

      Description

      Increase the loading and processing speed of validate_drp following suggestions by Paul Price

      1. Don't read in footprints
      Pass flags=lsst.afw.table.SOURCE_IO_NO_FOOTPRINTS to butler.get

      2. Work on speed of calculation of RMS and other expensive quantities. Current suggestions:
      a. calcRmsDistances
      b. multiMatch
      c. matchVisitComputeDistance
      d. Consider boolean indexing in afw's multiMatch.py

         objById = {record.get(self.objectKey): record for record in self.reference}
      to:
         objById = dict(zip(self.reference[self.objectKey], self.reference))
      

      Note that while this ticket will involve work to reduce the memory footprint of the processing, it will not cover work to re-architect things to enable efficient processing beyond the memory on one node.

        Attachments

          Activity

          Hide
          wmwood-vasey Michael Wood-Vasey added a comment -

          The tableLib.py reading remains the dominant contributor for the performance of validate_drp. I'm going to move toward including the work so far in this ticket which has improved the post-read performance to O(N log N) from O(N^2) for some calculations.

          But the fundamental issue remains reading the data in the first place. Good performance here is very much tied in with general infrastructure access models and I"ll defer to a later ticket when performance in validate_drp becomes a significant issue.

          Show
          wmwood-vasey Michael Wood-Vasey added a comment - The tableLib.py reading remains the dominant contributor for the performance of validate_drp . I'm going to move toward including the work so far in this ticket which has improved the post-read performance to O(N log N) from O(N^2) for some calculations. But the fundamental issue remains reading the data in the first place. Good performance here is very much tied in with general infrastructure access models and I"ll defer to a later ticket when performance in validate_drp becomes a significant issue.
          Hide
          wmwood-vasey Michael Wood-Vasey added a comment -

          Code now working. Waiting on merge of DM-6328 before rebasing this one and submitting for review.

          Show
          wmwood-vasey Michael Wood-Vasey added a comment - Code now working. Waiting on merge of DM-6328 before rebasing this one and submitting for review.
          Hide
          wmwood-vasey Michael Wood-Vasey added a comment -

          Relatively quick review.

          • Minor speed ups in calculation of the RMS distances. Now O(N log N) instead of O(N^2).
          • Added basic test case for `AMx` calculation with `matchVisitComputeDistance`.
          • Did not strip down catalog files in anticipate of future need for more information, e.g., as in DM-8951.

          But major time spent is actually in loading the catalogs. Will defer such work to future improvements to afw.table or alternate data access modes.

          Show
          wmwood-vasey Michael Wood-Vasey added a comment - Relatively quick review. Minor speed ups in calculation of the RMS distances. Now O(N log N) instead of O(N^2). Added basic test case for `AMx` calculation with `matchVisitComputeDistance`. Did not strip down catalog files in anticipate of future need for more information, e.g., as in DM-8951 . But major time spent is actually in loading the catalogs. Will defer such work to future improvements to afw.table or alternate data access modes.
          Hide
          afausti Angelo Fausti added a comment -

          Michael Wood-Vasey looks good, sorting the arrays for comparison certainly helps. Nice to see the results from cProfile and that reading the data dominates the overall execution time.

          Show
          afausti Angelo Fausti added a comment - Michael Wood-Vasey looks good, sorting the arrays for comparison certainly helps. Nice to see the results from cProfile and that reading the data dominates the overall execution time.
          Hide
          wmwood-vasey Michael Wood-Vasey added a comment -

          Merged to master.

          Show
          wmwood-vasey Michael Wood-Vasey added a comment - Merged to master.

            People

            • Assignee:
              wmwood-vasey Michael Wood-Vasey
              Reporter:
              wmwood-vasey Michael Wood-Vasey
              Reviewers:
              Angelo Fausti
              Watchers:
              Angelo Fausti, Michael Wood-Vasey, Paul Price
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: