Details
-
Type:
Improvement
-
Status: Won't Fix
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: validate_drp, Verification
-
Labels:None
-
Epic Link:
-
Team:DM Science
Description
The majority of the time when running validate_drp's MatchedVisitsMetricsTask is spent on ingest and building the MultiMatch catalog. This should be a trivially parallelizable step (modulo MultiMatch's threadsafeness), and could produce a drastic reduction in runtime for validate_drp on large datasets.
First step is to try a ThreadPool (to see if I/O parallelization gains us anything), and the next is to try a ProcessPool (and see whether the returned objects are pickleable).
Note: I played with this a bit more after rebasing (see the updated branch), and it's clear that the missing component is pickling of afw table Schema. I filed a ticket to do that. It looks like we should be able to get a big speed bump just parallelizing the file reads, leaving multimatch to be done in serial (with the hope of replacing multimatch with something much faster in the future). My evidence for this is some line_profile runs on the purely serial code where the split between loadOneCatalog and mmatch.add is roughly 60%/25% of the total runtime of _loadAndMatchCatalogs.