Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-17599

Add debugging utilities for empty QuantumGraphs

    Details

    • Story Points:
      4
    • Team:
      Data Access and Database

      Description

      One of the most common failure modes while developing concrete PipelineTasks is having QuantumGraph generation yield an empty graph, due to some problem with dataset type definitions or the state of the input repo.  We should brainstorm what kinds of diagnostics would be useful for debugging that case.

      To start with, if an empty graph is generated and a PipelineTask has an input dataset type that is not produced by any other task in the Pipeline, but has no elements in any input collection, we should produce a warning with the name of the task and dataset type.

      Any other ideas?

        Attachments

          Activity

          jbosch Jim Bosch created issue -
          Hide
          salnikov Andy Salnikov added a comment -

          I thought a little bit about it before and I think possibilities to mess things up at generating that one big query are almost infinite. The query that we generate can be thought of as:

          1. dimensions query to generate complete set of dimensions (universe), not constrained by anything.
          2. we constrain this universe with pre-existing datasets from input dataset types
          3. we further constrain it with the user-provided expression (and in the future probably with implicit SkyMap and Instrument)
          4. optionally remove quanta with all existing output datasets (this is not a part of the query but is done by graphBuilder)

          #1 should not normally generate empty result set unless database itself is not consistent, it may still be prudent for debugging to check that this step is not empty. Also if we do constrain SkyMap/Instrument at this step then incorrectly specified SkyMap or Instrument name can produce an empty result set. And if we do not constrain things at this point it may also be reasonable to check that things that are supposed to be unique (SkyMap) actually appear once only.

          #2 I think is covered by your comment - if for some input dataset type there are no datasets then this will also result in empty graph. Preflight does not know task names so it cannot generate message including that but it knows dataset type name and GraphBuilder can use that to find tasks which have that dataset type as input. Finding which dataset type has no dataset probably means that we need to take dimensions query (possibly with user-defined expression) and repeatedly run it with individual dataset types. There are of course more complicated cases when two input dataset types have non-empty datasets but their combination is empty (e.g. non-overlapping visit ranges).

          #3 user expression can also be too restrictive or incorrect. In principle it can be tested with dimensions query separately from datasets joins. Still like with datasets there is a possibility that separately they make non-empty selection but combining all things together becomes empty. (We also need better diagnostics for incorrect user expressions, that would mean replacing current text-based expressions with better representation from which we can generate sqlalchemy code).

          #4 is probably easiest to detect because it should start with non-empty selection and result in a graph which has some or all tasks excluded.

          Show
          salnikov Andy Salnikov added a comment - I thought a little bit about it before and I think possibilities to mess things up at generating that one big query are almost infinite. The query that we generate can be thought of as: dimensions query to generate complete set of dimensions (universe), not constrained by anything. we constrain this universe with pre-existing datasets from input dataset types we further constrain it with the user-provided expression (and in the future probably with implicit SkyMap and Instrument) optionally remove quanta with all existing output datasets (this is not a part of the query but is done by graphBuilder) #1 should not normally generate empty result set unless database itself is not consistent, it may still be prudent for debugging to check that this step is not empty. Also if we do constrain SkyMap/Instrument at this step then incorrectly specified SkyMap or Instrument name can produce an empty result set. And if we do not constrain things at this point it may also be reasonable to check that things that are supposed to be unique (SkyMap) actually appear once only. #2 I think is covered by your comment - if for some input dataset type there are no datasets then this will also result in empty graph. Preflight does not know task names so it cannot generate message including that but it knows dataset type name and GraphBuilder can use that to find tasks which have that dataset type as input. Finding which dataset type has no dataset probably means that we need to take dimensions query (possibly with user-defined expression) and repeatedly run it with individual dataset types. There are of course more complicated cases when two input dataset types have non-empty datasets but their combination is empty (e.g. non-overlapping visit ranges). #3 user expression can also be too restrictive or incorrect. In principle it can be tested with dimensions query separately from datasets joins. Still like with datasets there is a possibility that separately they make non-empty selection but combining all things together becomes empty. (We also need better diagnostics for incorrect user expressions, that would mean replacing current text-based expressions with better representation from which we can generate sqlalchemy code). #4 is probably easiest to detect because it should start with non-empty selection and result in a graph which has some or all tasks excluded.
          Hide
          mgower Michelle Gower added a comment -

          BPS would like to have a way to tell pipetask whether or not to exit with 0 if the final quantum graph is empty (if calling function someday, whether or not to raise an exception).   BPS won't care which is the default.

          Show
          mgower Michelle Gower added a comment - BPS would like to have a way to tell pipetask whether or not to exit with 0 if the final quantum graph is empty (if calling function someday, whether or not to raise an exception).   BPS won't care which is the default.
          Hide
          jbosch Jim Bosch added a comment -

          Adding a wild guess at story points for planning purposes, which would be happy to have updated by Andy Salnikov or interpreted as a time-box for this rather open-ended feature request.

          Show
          jbosch Jim Bosch added a comment - Adding a wild guess at story points for planning purposes, which would be happy to have updated by Andy Salnikov or interpreted as a time-box for this rather open-ended feature request.
          jbosch Jim Bosch made changes -
          Field Original Value New Value
          Story Points 4
          jbosch Jim Bosch made changes -
          Labels gen3-middleware gen2-deprecation-debt gen3-middleware

            People

            • Assignee:
              Unassigned
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andy Salnikov, Christopher Waters, Jim Bosch, Michelle Gower, Nate Lust, Yusra AlSayyad
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Summary Panel