Show
added a comment - Nate Lust and I discussed the more general problem on 2019/11/04, and came up with an algorithm that should work. It's a bigger change than the more limited DM-17169 approach, so we're currently planning to do that first to unblock jointcal PipelineTask conversion, but the more general algorithm has advantages that go beyond enabling jointcal to run inside a complete DRP pipeline, and hence we'd definitely like to get it done. I don't think doing it is a blocker on Gen2 deprecation. A description of the algorithm follows.
Interface Changes
Instead of passing a single data ID query expression when generating a QuantumGraph, users would be allowed to pass different expressions for different dataset types.
In most cases, only one expression would still be passed, and it would refer to a pure output dataset (one that is not used as an input by any Tasks in the Pipeline). The most common exception would be when explicitly selecting the visits that go into a coadd of a particular patch, in a pipeline that includes coaddition: in this case, one would pass a tract+patch expression for the coadd dataset or a catalog derived from it, and a visit+detector expression for calexp or warp.
Usually, these expressions would involve only the dimensions of those dataset types, but we could support spatially- or temporally-joined dimensions in at least some cases (e.g. the expressions for dataset types that are pure outputs), and we might be able to support them always. This would make it possible to (for example) run ISR on a tract, even though nothing in IsrTask references skymaps.
We envision some kind of syntactic sugar that makes it easy to pass a single expression without an explicit DatasetType in the common case, such as a per-Pipeline or per-PipelineTask implied DatasetType. Details of this are TBD.
PipelineTasks (or equivalently and more likely their Connections classes) would be expected to implement three new methods, but with default base class implementations that should be sufficient for all current PipelineTasks (jointcal and DECam ISR are the only currently-known exceptions); the definitions of these will be discussed later:
defineQuantaForward(registry: Registry, inputs: Dict[DatasetType, Union[List [DatasetRef] , Query]]) -> List [Quantum] : Given references to one or more input DatasetTypes, construct all Quanta that can be run with only those inputs for the given DatasetTypes and whatever is in the given registry for all other input DatasetTypes, subject to the dimension constraints in the given query expressions.
defineQuantaBackward(registry: Registry, outputs: Dict[DatasetType, Union[List [DatasetRef] , Query]]) -> List [Quantum] : Given references to one or more output DatasetTypes, construct Quanta that produce as many of them as possible, with inputs constrained by the dimensions in the given query expressions (but not what datasets are already in the registry!)
checkQuantum(quantum: Quantum) -> bool : Check whether the inputs of the given Quantum are sufficient for it to run. }}{{
A PipelineTask may choose not to implement defineQuantaBackward to indicate that it can only be used as the tail of a Pipeline; this seems appropriate for MetricTasks, CPP "Combine" Tasks, and possibly other things we have not anticipated. This essentially takes the place of the current "Prerequisite" input concept; this change removes the distinction between those and regular inputs.
Algorithm
Identify all data ID expressions associated with DatasetTypes that are not input-output dependent on any other DatasetType with a query expression. These are the starting DatasetTypes. Variant: use SELECT COUNT queries on from all given query expressions as a heuristic to guess which one(s) to start with.
For each PipelineTask that produces at least one such DatasetType, call defineQuantaBackward and add the created Quanta to the graph. If there are no such PipelineTasks (i.e. the expressions are only on pure input DatasetTypes), instead call defineQuantaForward on all PipelineTasks that take those DatasetTypes as inputs; add those Quanta to the graph.
Starting from those DatasetTypes, walk the Pipeline graph breadth-first from output-to-input, calling defineQuantaBackward , and (again breadth-first) input-to-output, calling defineQuantaForward , until one of these methods has been called on all PipelineTasks.
When pure input dataset types are reached, query the registry to test that the needed instances exist, subject to the constraints of any data ID query expressions. Prune any missing datasets from their Quanta.
Call checkQuantum on any Quanta that involve pure input datasets. Prune Quanta for which this returns False , and then prune the output datasets of those Quanta from any Quanta in which they are inputs. Call checkQuantum on these, and iterate.
Implications
More flexibility: by delegating much of QuantumGraph to individual PipelineTasks, we may be able to solve problems we have not anticipated, in addition to those present in jointcal and DECam ISR.
Probably faster queries: this reqlaces the monolithic and extremely complex multi-join query with many (~per data ID) simpler queries. Critically, the smaller queries should only ever have to do one-way spatial region or temporal interval lookups, rather than two-way spatial/temporal overlap joins. There may be more overhead due to the much larger number of queries, but it will be much harder to confuse the query optimizer and get horrible scaling, as we often seem to do now.
"Empty QuantumGraph" errors will be much easier to debug, as we'll be able to log at every step of the algorithm, making it clear where contradictory constraints appear.
We will have to rely on our own heuristics/timeouts/guards to avoid catastrophically bad decision-making on where to start, instead of (effectively) delegating this to the database query optimizer's heuristics. For example, considering the DRP pipeline, if a user data ID expression constrains the output to a single patch, it would be disastrous to start at the beginning of the pipeline and work forward by first generating single-frame processing quanta for every single visit in LSST DR10. It would be similarly bad to start at the end and work backward by generating coadd-processing quanta for every patch on the sky, if the user expression constrains the inputs to a handful of CCD images. This has been recognized as a fundamental problem and avoiding it has been considered a de facto requirement since the days of the SuperTask working group; it is responsible more than anything else for the current "big join query" architecture we use for QuantumGraph generation now. The fundamental assumption that enables this new design is that we can solve this problem via our own heuristics at least as well as the database can. To that end:
"start with datasets constrained by the user and work from there" addresses the vast majority of cases, in which there is only one query expression;
"start with the last dataset(s) constrained by the user" addresses the most common use case with multiple query expressions (specifing visits to be coadded);
"at least one query expression must be provided" seems necessary to avoid enormous queries by accidentally not constraining anything (especially given that Gen2 does nothing if you don't pass any data IDs, even though it iterates over everything else if you give it partial data IDs).
Andy Salnikov , I'd love to get your feedback on this (no rush at all). I've gotten the impression you've had something like this in mind in the past, and I've shot it down out of fear over the last point.
I've described a workaround for this on
DM-17169; that would let us run jointcal either in its own Pipeline, or as part of a Pipeline that also includes downstream steps, but not as part of a Pipeline that produces any of its inputs.