Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-407

Improve interface for clobbering vs. reusing intermediate data products

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      The top-level scripts in pipe_drivers each have several clobber* config options to control whether to skip processing steps whose output data products are already found in the input repository or its parents. These are powerful tools for experienced operators who want to be able to restart a large run without repeating any already-completed processing, but they are a constant source of confusion for even long-term developers for a number of reasons:

      1. These options are regular configuration options, which means they're mixed with the multitude of algorithmic configuration options and it's hard to identify the complete set for a particular driver.

      2. The fact that they're regular configuration options also means that --clobber-config must be used when they are changed, even though they do not affect the behavior of the algorithms.

      3. Because all input data repositories are searched when checking whether an intermediate data product exists, not just the output data repository, skipped steps are frequently surprising to (and unnoticed by) users, who reasonably expect that data products in an input repository (often created with a different configuration) should not prevent data products from being written to the output repository.

      4. The default configuration is to skip steps when their output data products are present, which makes (3) much more likely to occur.

      5. It is possible to specify a nonsensical combination of clobber options (i.e. for ordered steps A, B, and C that depend on each other, skip only B if its outputs exist), and this can lead to deferred and confusing errors (e.g. DM-12499).

      The proposal below attempts to address problems (1), (2), (4), and (5) above:

      • Remove all clobber* config options.
      • Add a --reuse-outputs-from command-line option (not a config option) to all pipe_drivers scripts that can either take the special value all or the name of a single top-level subtask (e.g. mergeCoaddMeasurements in multiBandDriver.py). This will cause that subtask and any previous subtasks to be skipped if their output data products are already present.
      • Make the default behavior to not skip steps.
      • Add INFO-level log messages that report when a step has been skipped due to an existing data product.

      I'd love to address (3) as well, but I'd need input from Kian-Tat Lim and/or Nate Pease on how best to go about that. I think the functionality we need is a version of Butler.datasetExists that looks only in input-output repositories, not input-only repositories, and I don't know if that's a something that already exists, something easy to add, or something structurally difficult.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Watchers:
                Jim Bosch, Kian-Tat Lim, Nate Pease
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Planned End:

                  Summary Panel