Chris Morrison reports that ap_verify can time out when run over large datasets on lsst-dev. This is partly because of the time needed to run the pipeline itself, but also the time needed to iterate through all metrics (since MetricsControllerTask was never parallelized). ap_verify does not offer any error recovery options beyond rerunning the entire pipeline.
Both ap_verify's current control system and MetricsControllerTask will become obsolete with Gen 3, where responsibility for workflow management (and any checkpointing) will lie with the pipeline activator. Rather than try to design proper restart behavior into ap_verify now, provide a --skip-completed command-line flag that does the following:
- runs ap_pipe with the --reuse-outputs-from all command-line argument, which skips completed pipeline steps (currently, up through association).
- makes MetricsControllerTask check for a job file associated with each data ID, and skips processing that data ID if the file already exists
This flag should be enough to let us retry large runs efficiently until Gen 2 is retired.