Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21904

Improve QuantumGraph generation for Jointcal

    XMLWordPrintable

    Details

    • Story Points:
      8
    • Sprint:
      DRP S20-5 (Apr)
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Our current QuantumGraph generation algorithm doesn't quite work for jointcal - it currently assumes we want the intersection of all dimensions involved in a quantum, which means the intersection of the tract region and the (visit, detector) region for jointcal.  But jointcal actually wants any (visit, detector) for which the visit region overlaps the tract region, so we need an extra level of indirection somewhere.

      The lower-level Registry.queryDimensions API already has some support for this (it can be told to use visit regions instead of (visit, detector) regions even if the latter are available - that would need to be coupled with additional logic to somehow inform the QG-generation code that it should do this - this information would need to come from the Jointcal PipelineTask, but we need to reconcile that with the needs of any other tasks in the same Pipeline, and that's tricky.

      The problem would be easier if we declare jointcal's inputs to be prerequisites, meaning that jointcal could not be run in the same Pipeline as the tasks that produce those inputs.  We should look at least briefly for options without that restriction first.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            I've described a workaround for this on DM-17169; that would let us run jointcal either in its own Pipeline, or as part of a Pipeline that also includes downstream steps, but not as part of a Pipeline that produces any of its inputs.

            Show
            jbosch Jim Bosch added a comment - I've described a workaround for this on DM-17169 ; that would let us run jointcal either in its own Pipeline, or as part of a Pipeline that also includes downstream steps, but not as part of a Pipeline that produces any of its inputs.
            Hide
            jbosch Jim Bosch added a comment -

            Nate Lust and I discussed the more general problem on 2019/11/04, and came up with an algorithm that should work.  It's a bigger change than the more limited DM-17169 approach, so we're currently planning to do that first to unblock jointcal PipelineTask conversion, but the more general algorithm has advantages that go beyond enabling jointcal to run inside a complete DRP pipeline, and hence we'd definitely like to get it done.  I don't think doing it is a blocker on Gen2 deprecation.  A description of the algorithm follows.

            Interface Changes

            Instead of passing a single data ID query expression when generating a QuantumGraph, users would be allowed to pass different expressions for different dataset types. 

            • In most cases, only one expression would still be passed, and it would refer to a pure output dataset (one that is not used as an input by any Tasks in the Pipeline).  The most common exception would be when explicitly selecting the visits that go into a coadd of a particular patch, in a pipeline that includes coaddition: in this case, one would pass a tract+patch expression for the coadd dataset or a catalog derived from it, and a visit+detector expression for calexp or warp.
            • Usually, these expressions would involve only the dimensions of those dataset types, but we could support spatially- or temporally-joined dimensions in at least some cases (e.g. the expressions for dataset types that are pure outputs), and we might be able to support them always.  This would make it possible to (for example) run ISR on a tract, even though nothing in IsrTask references skymaps.
            • We envision some kind of syntactic sugar that makes it easy to pass a single expression without an explicit DatasetType in the common case, such as a per-Pipeline or per-PipelineTask implied DatasetType.  Details of this are TBD.

            PipelineTasks (or equivalently and more likely their Connections classes) would be expected to implement three new methods, but with default base class implementations that should be sufficient for all current PipelineTasks (jointcal and DECam ISR are the only currently-known exceptions); the definitions of these will be discussed later:

            • defineQuantaForward(registry: Registry, inputs: Dict[DatasetType, Union[List[DatasetRef], Query]]) -> List[Quantum]: Given references to one or more input DatasetTypes, construct all Quanta that can be run with only those inputs for the given DatasetTypes and whatever is in the given registry for all other input DatasetTypes, subject to the dimension constraints in the given query expressions.
            • defineQuantaBackward(registry: Registry, outputs: Dict[DatasetType, Union[List[DatasetRef], Query]]) -> List[Quantum]: Given references to one or more output DatasetTypes, construct Quanta that produce as many of them as possible, with inputs constrained by the dimensions in the given query expressions (but not what datasets are already in the registry!)
            • checkQuantum(quantum: Quantum) -> bool: Check whether the inputs of the given Quantum are sufficient for it to run.}}{{

            A PipelineTask may choose not to implement defineQuantaBackward to indicate that it can only be used as the tail of a Pipeline; this seems appropriate for MetricTasks, CPP "Combine" Tasks, and possibly other things we have not anticipated.  This essentially takes the place of the current "Prerequisite" input concept; this change removes the distinction between those and regular inputs.

            Algorithm

            1. Identify all data ID expressions associated with DatasetTypes that are not input-output dependent on any other DatasetType with a query expression. These are the starting DatasetTypes.  Variant: use SELECT COUNT queries on from all given query expressions as a heuristic to guess which one(s) to start with.
            2. For each PipelineTask that produces at least one such DatasetType, call defineQuantaBackward and add the created Quanta to the graph.  If there are no such PipelineTasks (i.e. the expressions are only on pure input DatasetTypes), instead call defineQuantaForward on all PipelineTasks that take those DatasetTypes as inputs; add those Quanta to the graph.
            3. Starting from those DatasetTypes, walk the Pipeline graph breadth-first from output-to-input, calling defineQuantaBackward, and (again breadth-first) input-to-output, calling defineQuantaForward, until one of these methods has been called on all PipelineTasks.
            4. When pure input dataset types are reached, query the registry to test that the needed instances exist, subject to the constraints of any data ID query expressions.  Prune any missing datasets from their Quanta.
            5. Call checkQuantum on any Quanta that involve pure input datasets.  Prune Quanta for which this returns False, and then prune the output datasets of those Quanta from any Quanta in which they are inputs.  Call checkQuantum on these, and iterate.

            Implications

            • More flexibility: by delegating much of QuantumGraph to individual PipelineTasks, we may be able to solve problems we have not anticipated, in addition to those present in jointcal and DECam ISR.
            • Probably faster queries: this reqlaces the monolithic and extremely complex multi-join query with many (~per data ID) simpler queries.  Critically, the smaller queries should only ever have to do one-way spatial region or temporal interval lookups, rather than two-way spatial/temporal overlap joins.  There may be more overhead due to the much larger number of queries, but it will be much harder to confuse the query optimizer and get horrible scaling, as we often seem to do now.
            • "Empty QuantumGraph" errors will be much easier to debug, as we'll be able to log at every step of the algorithm, making it clear where contradictory constraints appear.
            • We will have to rely on our own heuristics/timeouts/guards to avoid catastrophically bad decision-making on where to start, instead of (effectively) delegating this to the database query optimizer's heuristics.  For example, considering the DRP pipeline, if a user data ID expression constrains the output to a single patch, it would be disastrous to start at the beginning of the pipeline and work forward by first generating single-frame processing quanta for every single visit in LSST DR10.  It would be similarly bad to start at the end and work backward by generating coadd-processing quanta for every patch on the sky, if the user expression constrains the inputs to a handful of CCD images.  This has been recognized as a fundamental problem and avoiding it has been considered a de facto requirement since the days of the SuperTask working group; it is responsible more than anything else for the current "big join query" architecture we use for QuantumGraph generation now.  The fundamental assumption that enables this new design is that we can solve this problem via our own heuristics at least as well as the database can.  To that end:
              • "start with datasets constrained by the user and work from there" addresses the vast majority of cases, in which there is only one query expression;
              • "start with the last dataset(s) constrained by the user" addresses the most common use case with multiple query expressions (specifing visits to be coadded);
              • "at least one query expression must be provided" seems necessary to avoid enormous queries by accidentally not constraining anything (especially given that Gen2 does nothing if you don't pass any data IDs, even though it iterates over everything else if you give it partial data IDs).

            Andy Salnikov, I'd love to get your feedback on this (no rush at all).  I've gotten the impression you've had something like this in mind in the past, and I've shot it down out of fear over the last point.

            Show
            jbosch Jim Bosch added a comment - Nate Lust and I discussed the more general problem on 2019/11/04, and came up with an algorithm that should work.  It's a bigger change than the more limited DM-17169 approach, so we're currently planning to do that first to unblock jointcal PipelineTask conversion, but the more general algorithm has advantages that go beyond enabling jointcal to run inside a complete DRP pipeline, and hence we'd definitely like to get it done.  I don't think doing it is a blocker on Gen2 deprecation.  A description of the algorithm follows. Interface Changes Instead of passing a single data ID query expression when generating a QuantumGraph, users would be allowed to pass different expressions for different dataset types.  In most cases, only one expression would still be passed, and it would refer to a pure output dataset (one that is not used as an input by any Tasks in the Pipeline).  The most common exception would be when explicitly selecting the visits that go into a coadd of a particular patch, in a pipeline that includes coaddition: in this case, one would pass a tract+patch expression for the coadd dataset or a catalog derived from it, and a visit+detector expression for calexp or warp. Usually, these expressions would involve only the dimensions of those dataset types, but we could support spatially- or temporally-joined dimensions in at least some cases (e.g. the expressions for dataset types that are pure outputs), and we might be able to support them always.  This would make it possible to (for example) run ISR on a tract, even though nothing in IsrTask references skymaps. We envision some kind of syntactic sugar that makes it easy to pass a single expression without an explicit DatasetType in the common case, such as a per-Pipeline or per-PipelineTask implied DatasetType.  Details of this are TBD. PipelineTasks (or equivalently and more likely their Connections classes) would be expected to implement three new methods, but with default base class implementations that should be sufficient for all current PipelineTasks (jointcal and DECam ISR are the only currently-known exceptions); the definitions of these will be discussed later: defineQuantaForward(registry: Registry, inputs: Dict[DatasetType, Union[List [DatasetRef] , Query]]) -> List [Quantum] : Given references to one or more input DatasetTypes, construct all Quanta that can be run with only those inputs for the given DatasetTypes and whatever is in the given registry for all other input DatasetTypes, subject to the dimension constraints in the given query expressions. defineQuantaBackward(registry: Registry, outputs: Dict[DatasetType, Union[List [DatasetRef] , Query]]) -> List [Quantum] : Given references to one or more output DatasetTypes, construct Quanta that produce as many of them as possible, with inputs constrained by the dimensions in the given query expressions (but not what datasets are already in the registry!) checkQuantum(quantum: Quantum) -> bool : Check whether the inputs of the given Quantum are sufficient for it to run. }}{{ A PipelineTask may choose not to implement defineQuantaBackward to indicate that it can only be used as the tail of a Pipeline; this seems appropriate for MetricTasks, CPP "Combine" Tasks, and possibly other things we have not anticipated.  This essentially takes the place of the current "Prerequisite" input concept; this change removes the distinction between those and regular inputs. Algorithm Identify all data ID expressions associated with DatasetTypes that are not input-output dependent on any other DatasetType with a query expression. These are the starting DatasetTypes.  Variant: use SELECT COUNT queries on from all given query expressions as a heuristic to guess which one(s) to start with. For each PipelineTask that produces at least one such DatasetType, call defineQuantaBackward and add the created Quanta to the graph.  If there are no such PipelineTasks (i.e. the expressions are only on pure input DatasetTypes), instead call defineQuantaForward on all PipelineTasks that take those DatasetTypes as inputs; add those Quanta to the graph. Starting from those DatasetTypes, walk the Pipeline graph breadth-first from output-to-input, calling defineQuantaBackward , and (again breadth-first) input-to-output, calling defineQuantaForward , until one of these methods has been called on all PipelineTasks. When pure input dataset types are reached, query the registry to test that the needed instances exist, subject to the constraints of any data ID query expressions.  Prune any missing datasets from their Quanta. Call checkQuantum on any Quanta that involve pure input datasets.  Prune Quanta for which this returns False , and then prune the output datasets of those Quanta from any Quanta in which they are inputs.  Call  checkQuantum on these, and iterate. Implications More flexibility: by delegating much of QuantumGraph to individual PipelineTasks, we may be able to solve problems we have not anticipated, in addition to those present in jointcal and DECam ISR. Probably faster queries: this reqlaces the monolithic and extremely complex multi-join query with many (~per data ID) simpler queries.  Critically, the smaller queries should only ever have to do one-way spatial region or temporal interval lookups, rather than two-way spatial/temporal overlap joins.  There may be more overhead due to the much larger number of queries, but it will be much harder to confuse the query optimizer and get horrible scaling, as we often seem to do now. "Empty QuantumGraph" errors will be much easier to debug, as we'll be able to log at every step of the algorithm, making it clear where contradictory constraints appear. We will have to rely on our own heuristics/timeouts/guards to avoid catastrophically bad decision-making on where to start, instead of (effectively) delegating this to the database query optimizer's heuristics.  For example, considering the DRP pipeline, if a user data ID expression constrains the output to a single patch, it would be disastrous to start at the beginning of the pipeline and work forward by first generating single-frame processing quanta for every single visit in LSST DR10.  It would be similarly bad to start at the end and work backward by generating coadd-processing quanta for every patch on the sky, if the user expression constrains the inputs to a handful of CCD images.  This has been recognized as a fundamental problem and avoiding it has been considered a de facto requirement since the days of the SuperTask working group; it is responsible more than anything else for the current "big join query" architecture we use for QuantumGraph generation now.  The fundamental assumption that enables this new design is that we can solve this problem via our own heuristics at least as well as the database can.  To that end: "start with datasets constrained by the user and work from there" addresses the vast majority of cases, in which there is only one query expression; "start with the last dataset(s) constrained by the user" addresses the most common use case with multiple query expressions (specifing visits to be coadded); "at least one query expression must be provided" seems necessary to avoid enormous queries by accidentally not constraining anything (especially given that Gen2 does nothing if you don't pass any data IDs, even though it iterates over everything else if you give it partial data IDs). Andy Salnikov , I'd love to get your feedback on this (no rush at all).  I've gotten the impression you've had something like this in mind in the past, and I've shot it down out of fear over the last point.
            Hide
            tjenness Tim Jenness added a comment -

            Are we abandoning this ticket?

            I see that Nate Lust has made some commits exploring thread-safety in butler and async. The SQuaRE team would absolutely love it if we gave them an async compatible butler (they use async sqlalchemy everywhere else I think).

            Show
            tjenness Tim Jenness added a comment - Are we abandoning this ticket? I see that Nate Lust has made some commits exploring thread-safety in butler and async. The SQuaRE team would absolutely love it if we gave them an async compatible butler (they use async sqlalchemy everywhere else I think).
            Hide
            jbosch Jim Bosch added a comment - - edited

            We are definitely not abandoning this ticket, though both the algorithm description in the comments above and the branch need at least a big update.  I'd have to talk with Nate Lust about whether updating the branch is worth it - I've certainly come around on the idea of putting more async stuff into Butler, now that we've got a Registry server to write (as Nate knows, I used to be quite skeptical), but I'd also like it to be done with a lot of intentionality and up-front design if we go that route.

            Show
            jbosch Jim Bosch added a comment - - edited We are definitely not abandoning this ticket, though both the algorithm description in the comments above and the branch need at least a big update.  I'd have to talk with Nate Lust about whether updating the branch is worth it - I've certainly come around on the idea of putting more async stuff into Butler, now that we've got a Registry server to write (as Nate knows, I used to be quite skeptical), but I'd also like it to be done with a lot of intentionality and up-front design if we go that route.

              People

              Assignee:
              nlust Nate Lust
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andy Salnikov, Christopher Waters, Jim Bosch, John Parejko, Michelle Gower, Nate Lust, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.