Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-22120

ap_verify scales poorly to large runs

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ap_verify, verify
    • Labels:
      None
    • Story Points:
      2
    • Sprint:
      AP F19-6 (November)
    • Team:
      Alert Production

      Description

      Chris Morrison reports that ap_verify can time out when run over large datasets on lsst-dev. This is partly because of the time needed to run the pipeline itself, but also the time needed to iterate through all metrics (since MetricsControllerTask was never parallelized). ap_verify does not offer any error recovery options beyond rerunning the entire pipeline.

      Both ap_verify's current control system and MetricsControllerTask will become obsolete with Gen 3, where responsibility for workflow management (and any checkpointing) will lie with the pipeline activator. Rather than try to design proper restart behavior into ap_verify now, provide a --skip-completed command-line flag that does the following:

      • runs ap_pipe with the --reuse-outputs-from all command-line argument, which skips completed pipeline steps (currently, up through association).
      • makes MetricsControllerTask check for a job file associated with each data ID, and skips processing that data ID if the file already exists

      This flag should be enough to let us retry large runs efficiently until Gen 2 is retired.

        Attachments

          Issue Links

            Activity

            Hide
            krzys Krzysztof Findeisen added a comment -

            Chris Morrison, given the timing problems you mentioned on DM-14019, should the flag skip everything in ap_pipe, or just everything through image differencing?

            Show
            krzys Krzysztof Findeisen added a comment - Chris Morrison , given the timing problems you mentioned on DM-14019 , should the flag skip everything in ap_pipe , or just everything through image differencing?
            Hide
            jbosch Jim Bosch added a comment -

            We just ran into a problem with running ap_verify at scale (specifically, on lsst-dev). I've filed this ticket to work around it for now, but I'm working on the assumption that the more cluster-oriented Gen 3 activators will provide some kind of resume-on-failure functionality. Will that be the case eventually (I assume it definitely won't be at the time of Gen 2 deprecation)? 

            Exactly: there are plans, but probably won't have anything in place by Gen2 deprecation.  We do already have similar (and actually much better) support for manual retries that automatically run only what wasn't run successfully before.

            Show
            jbosch Jim Bosch added a comment - We just ran into a problem with running ap_verify at scale (specifically, on lsst-dev ). I've filed this ticket to work around it for now, but I'm working on the assumption that the more cluster-oriented Gen 3 activators will provide some kind of resume-on-failure functionality. Will that be the case eventually (I assume it definitely won't be at the time of Gen 2 deprecation)?  Exactly: there are plans, but probably won't have anything in place by Gen2 deprecation.  We do already have similar (and actually much better) support for manual retries that automatically run only what wasn't run successfully before.
            Hide
            cmorrison Chris Morrison added a comment -

            Krzysztof Findeisen For my specific case, skipping all of ap_pipe would be the preference. The ap_pipe part does run in reasonable time on 24 cores of a verification cluster node (overnight-ish) and the iteration of the metric task can be pretty slow.

            Show
            cmorrison Chris Morrison added a comment - Krzysztof Findeisen For my specific case, skipping all of ap_pipe would be the preference. The ap_pipe part does run in reasonable time on 24 cores of a verification cluster node (overnight-ish) and the iteration of the metric task can be pretty slow.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Does this work for you, Chris Morrison?

            PRs aren't showing up; they're verify#56 and ap_verify#74.

            Show
            krzys Krzysztof Findeisen added a comment - - edited Does this work for you, Chris Morrison ? PRs aren't showing up; they're verify#56 and ap_verify#74 .
            Hide
            cmorrison Chris Morrison added a comment -

            So, this would mean, just to be clear:

            If I wanted to recover if the pipeline fails during metric persistence, I would use "ap_verify.py --skip-completed ..."

            If I wanted to use the same set of outputs from ap_pipe to compute new or additional metrics on the same dataset, I would use "ap_verify.py --reuse-outputs-from all ..."

            Is that correct?

            Show
            cmorrison Chris Morrison added a comment - So, this would mean, just to be clear: If I wanted to recover if the pipeline fails during metric persistence, I would use "ap_verify.py --skip-completed ..." If I wanted to use the same set of outputs from ap_pipe to compute new or additional metrics on the same dataset, I would use "ap_verify.py --reuse-outputs-from all ..." Is that correct?
            Hide
            krzys Krzysztof Findeisen added a comment -

            If I wanted to use the same set of outputs from ap_pipe to compute new or additional metrics on the same dataset

            Oops, I overlooked that requirement. Sorry, I'll split the two skips into separate arguments.

            Show
            krzys Krzysztof Findeisen added a comment - If I wanted to use the same set of outputs from ap_pipe to compute new or additional metrics on the same dataset Oops, I overlooked that requirement. Sorry, I'll split the two skips into separate arguments.
            Hide
            krzys Krzysztof Findeisen added a comment - - edited

            Okay, it's ready for review now. Revised UI:

            • --skip-pipeline skips the AP pipeline entirely (rather than using --reuse-outputs-from). This preserves all datasets, including metadata, but doesn't help you if the pipeline fails and you need to recover.
            • --skip-existing-metrics will avoid recalculating metrics for any data ID that's already been written to disk. This can be used for recovery, but won't recognize if a file got corrupted (e.g., pipeline crashed mid-write).
            Show
            krzys Krzysztof Findeisen added a comment - - edited Okay, it's ready for review now. Revised UI: --skip-pipeline skips the AP pipeline entirely (rather than using --reuse-outputs-from ). This preserves all datasets, including metadata, but doesn't help you if the pipeline fails and you need to recover. --skip-existing-metrics will avoid recalculating metrics for any data ID that's already been written to disk. This can be used for recovery, but won't recognize if a file got corrupted (e.g., pipeline crashed mid-write).
            Hide
            cmorrison Chris Morrison added a comment -

            Looks good. Thanks for getting to this so fast.

            Show
            cmorrison Chris Morrison added a comment - Looks good. Thanks for getting to this so fast.

              People

              • Assignee:
                krzys Krzysztof Findeisen
                Reporter:
                krzys Krzysztof Findeisen
                Reviewers:
                Chris Morrison
                Watchers:
                Chris Morrison, Jim Bosch, Krzysztof Findeisen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel