Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15221

parallelize validate_drp ingest

    Details

    • Type: Improvement
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: validate_drp, Verification
    • Labels:
      None

      Description

      The majority of the time when running validate_drp's MatchedVisitsMetricsTask is spent on ingest and building the MultiMatch catalog. This should be a trivially parallelizable step (modulo MultiMatch's threadsafeness), and could produce a drastic reduction in runtime for validate_drp on large datasets.

      First step is to try a ThreadPool (to see if I/O parallelization gains us anything), and the next is to try a ProcessPool (and see whether the returned objects are pickleable).

        Attachments

          Issue Links

            Activity

            Hide
            Parejkoj John Parejko added a comment -

            Note: I played with this a bit more after rebasing (see the updated branch), and it's clear that the missing component is pickling of afw table Schema. I filed a ticket to do that. It looks like we should be able to get a big speed bump just parallelizing the file reads, leaving multimatch to be done in serial (with the hope of replacing multimatch with something much faster in the future). My evidence for this is some line_profile runs on the purely serial code where the split between loadOneCatalog and mmatch.add is roughly 60%/25% of the total runtime of _loadAndMatchCatalogs.

            Show
            Parejkoj John Parejko added a comment - Note: I played with this a bit more after rebasing (see the updated branch), and it's clear that the missing component is pickling of afw table Schema. I filed a ticket to do that. It looks like we should be able to get a big speed bump just parallelizing the file reads, leaving multimatch to be done in serial (with the hope of replacing multimatch with something much faster in the future). My evidence for this is some line_profile runs on the purely serial code where the split between loadOneCatalog and mmatch.add is roughly 60%/25% of the total runtime of _loadAndMatchCatalogs .
            Hide
            wmwood-vasey Michael Wood-Vasey added a comment -

            Is the fundamental issue

            1. That the afw.table reading in the first place is slow?
            2. The calculation of the calibrated values (and required reading of the calexp metdata)?

            If #1 then it seems like speeding up the original reading is the key thing to do.
            If it's #2, then the intermediate persisting of files may be a useful step.

            Show
            wmwood-vasey Michael Wood-Vasey added a comment - Is the fundamental issue 1. That the afw.table reading in the first place is slow? 2. The calculation of the calibrated values (and required reading of the calexp metdata)? If #1 then it seems like speeding up the original reading is the key thing to do. If it's #2, then the intermediate persisting of files may be a useful step.
            Hide
            Parejkoj John Parejko added a comment -

            Most of the time is spent reading the catalog and metadata: the calibration calculations are less than 5% of the runtime.

            Show
            Parejkoj John Parejko added a comment - Most of the time is spent reading the catalog and metadata: the calibration calculations are less than 5% of the runtime.

              People

              • Assignee:
                jcarlin Jeffrey Carlin
                Reporter:
                Parejkoj John Parejko
                Watchers:
                Jim Bosch, John Parejko, John Swinbank, Michael Wood-Vasey, Simon Krughoff
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Summary Panel