Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15214

QuantumGraph option to filter existing output Datasets

    Details

      Description

      Current version of QG builder does not check whether any output Datasets that it generates are already there in the Registry. We need an option to either filter those existing Datasets from output or generate an exception.

        Attachments

          Activity

          Hide
          salnikov Andy Salnikov added a comment -

          Jim Bosch, I think this is ready for review now. Pre-flight now returns a bit more information about input/output datasets which allows GraphBuilder to either filter Quanta for which all outputs exist or raise an exception.

          Pre-flight now also implements dataset search in multiple input collections taking order of collections into account, returned DatasetRef now has id which can be used to resolve collection ambiguity in the post-preflight stage.

          Tim Jenness, you are welcome to continue reviewing my changes too

          Show
          salnikov Andy Salnikov added a comment - Jim Bosch , I think this is ready for review now. Pre-flight now returns a bit more information about input/output datasets which allows GraphBuilder to either filter Quanta for which all outputs exist or raise an exception. Pre-flight now also implements dataset search in multiple input collections taking order of collections into account, returned DatasetRef now has id which can be used to resolve collection ambiguity in the post-preflight stage. Tim Jenness , you are welcome to continue reviewing my changes too
          Hide
          jbosch Jim Bosch added a comment - - edited

          This is really impressive, Andy Salnikov. I've left a number of minor and for-the-future comments on the PRs. I especially appreciate all the effort you've gone to to document and comment the algorithms - it's still sufficiently complex that I worry about anyone being able to maintain it other than you, but all of the examples you've given really help with that.

          I'd like to expand on those "comments for the future" here:

          I think the object currently called PreFlightUnitsRow has the potential to become a really fundamental piece of the Registry API, well beyond just preflight. Iterators that yield PreflightUnitsRow could be how we represent a repository Subset, which is something we've talked about conceptually a long time without ever defining it at the code level. Some possibilities:

          • You could produce a Subset given a WHERE expression, some Collections, and some DatasetTypes (via selectDataUnits).
          • You could produce a Subset via custom, hand-written SQL.
          • You could consume a Subset to produce a QuantumGraph for new processing to be done.
          • You could consume a Subset to produce a QuantumGraph representing the provenance of processing already done.
          • You could consume a Subset to produce a set of DataIds (i.e. removing duplicates).
          • You could consume a Subset to produce a set of DatasetRefs.
          • You could consume a Subset to dump Registry content to (e.g.) CSV for export/transfer.
          • You could filter (consume and produce) a Subset in many different ways: only certain DatasetTypes, require certain DataUnit links to be non-null, a many different kinds of spatial filters, etc.

          I think DM-15034 will need to be done before we start generalizing the concept to all of those use cases - that ticket should move some of the smarts in selectDataUnits into other classes, and it might change how we decide to map what PreFlightUnitRow conceptually represents (DataId/DatasetRef/Region) into its actual Python attributes. I really need to find time to work on that ticket. Obviously, when we do generalize the code we have now into that Subset concept, we'd want to rename PreFlightUnitsRow to something more general-sounding at the same time.

          In the meantime, I think it'd be useful for you to just keep this idea in mind (and by all means chime in with your thoughts). In particular, I'm curious about:

          • Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter.
          • Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them.
          • Related: would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?
          Show
          jbosch Jim Bosch added a comment - - edited This is really impressive, Andy Salnikov . I've left a number of minor and for-the-future comments on the PRs. I especially appreciate all the effort you've gone to to document and comment the algorithms - it's still sufficiently complex that I worry about anyone being able to maintain it other than you, but all of the examples you've given really help with that. I'd like to expand on those "comments for the future" here: I think the object currently called PreFlightUnitsRow has the potential to become a really fundamental piece of the Registry API, well beyond just preflight. Iterators that yield PreflightUnitsRow could be how we represent a repository Subset , which is something we've talked about conceptually a long time without ever defining it at the code level. Some possibilities: You could produce a Subset given a WHERE expression, some Collections, and some DatasetTypes (via selectDataUnits). You could produce a Subset via custom, hand-written SQL. You could consume a Subset to produce a QuantumGraph for new processing to be done. You could consume a Subset to produce a QuantumGraph representing the provenance of processing already done. You could consume a Subset to produce a set of DataIds (i.e. removing duplicates). You could consume a Subset to produce a set of DatasetRefs. You could consume a Subset to dump Registry content to (e.g.) CSV for export/transfer. You could filter (consume and produce) a Subset in many different ways: only certain DatasetTypes, require certain DataUnit links to be non-null, a many different kinds of spatial filters, etc. I think DM-15034 will need to be done before we start generalizing the concept to all of those use cases - that ticket should move some of the smarts in selectDataUnits into other classes, and it might change how we decide to map what PreFlightUnitRow conceptually represents (DataId/DatasetRef/Region) into its actual Python attributes. I really need to find time to work on that ticket. Obviously, when we do generalize the code we have now into that Subset concept, we'd want to rename PreFlightUnitsRow to something more general-sounding at the same time. In the meantime, I think it'd be useful for you to just keep this idea in mind (and by all means chime in with your thoughts). In particular, I'm curious about: Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter. Is a Subset just a regular iterator over PreFlightUnitsRow , or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them. Related: would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?
          Hide
          salnikov Andy Salnikov added a comment -

          Jim, no objections to trying to generalize PreFlightUnitsRow if it can be made useful elsewhere. I'm not sure I have that deep knowledge of all concepts, I'm mostly discovering things by looking at them from pre-flight side.

          Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter.

          I'm sure things can be made more modular and reusable. Spatial filtering can probably be moved to a separate filter but I suspect that this would be a 100% obligatory filter, I can't imagine anyone would want to deal with regions that don't overlap (ideally all selection should be based on regions only, I consider SkyMap as just an optimization that most people don't need to worry about).

          Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them.

          It is regular iterator, all structure is contained in the record itself (which may not be super-efficient if we have large volumes of those records).

          would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None?

          This could be useful for some cases, I guess. For pre-flight I don't think it matters as we still need to know which missing things we need to make.

          BTW, I don't see your comments on PR, did you click on Submit Review?

          Show
          salnikov Andy Salnikov added a comment - Jim, no objections to trying to generalize PreFlightUnitsRow  if it can be made useful elsewhere. I'm not sure I have that deep knowledge of all concepts, I'm mostly discovering things by looking at them from pre-flight side. Can we refactor any more of the preflight logic into subset producers/consumers/filters? The post-query spatial filtering to remove disjoint spatial DataUnits seems like at least one candidate for a filter. I'm sure things can be made more modular and reusable. Spatial filtering can probably be moved to a separate filter but I suspect that this would be a 100% obligatory filter, I can't imagine anyone would want to deal with regions that don't overlap (ideally all selection should be based on regions only, I consider SkyMap as just an optimization that most people don't need to worry about). Is a Subset just a regular iterator over PreFlightUnitsRow, or does it need to be a special iterator with some kind of header that describes what's common to all rows? Regular iterators would be really nice, because then we could use all kinds of itertools operations an generator-expression syntax on them. It is regular iterator, all structure is contained in the record itself (which may not be super-efficient if we have large volumes of those records). would it ever be useful/possible to support heterogeneous Subsets, in which the set of links in the DataId change from row to row? Could we implement that by just treating missing links as null/None? This could be useful for some cases, I guess. For pre-flight I don't think it matters as we still need to know which missing things we need to make. BTW, I don't see your comments on PR, did you click on Submit Review?
          Hide
          jbosch Jim Bosch added a comment -

          I had indeed forgotten to hit Submit in GitHub; now done.

          Show
          jbosch Jim Bosch added a comment - I had indeed forgotten to hit Submit in GitHub; now done.
          Hide
          salnikov Andy Salnikov added a comment -

          I think I addressed all concerns, Jenkins has passed (and I had to re-base twice) so  both packages merged now.

          Show
          salnikov Andy Salnikov added a comment - I think I addressed all concerns, Jenkins has passed (and I had to re-base twice) so  both packages merged now.

            People

            • Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Reviewers:
              Jim Bosch
              Watchers:
              Andy Salnikov, Jim Bosch
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel