# IsrTask shoud use regular Input for raw data

XMLWordPrintable

## Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
1
• Team:
Data Release Production
• Urgent?:
No

## Description

Currently IsrTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used.

## Activity

Andy Salnikov created issue -
Field Original Value New Value
Assignee Chris Walter [ cwalter ] Christopher Waters [ cwaters ]
Hide
Christopher Waters added a comment -

From the slack link, I was able to run the command:

  setup lsst_distrib  setup -vkr /project/krzys001/ap_verify_ci_hits2015/  setup -vkr /scratch/krzys001/daf_butler/  setup -vkr /scratch/krzys001/obs_base/  setup -vkr /scratch/krzys001/obs_decam/  cd /scratch/krzys001/gen3demo/  python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml -C calibrate:config/calibrate.py

This results in the empty quantum graph, which I believe I would expect from a call to pipetask with no data constraints.  Adding the constraint

 -d 'detector=10 exposure=419802'

results in a quantum graph with three tasks, corresponding to my expectation for ProcessCcd.

I don't see an issue with what is happening.  With no constraints, I would expect the graph to iterate over the "exposure" and "detector" table entries, and generate the set of jobs that can be done.  Blocking on a missing raw file seems like the correct response to me;  the implied request was to process everything, and that cannot be successfully done.

Is there an issue of a difference in expectations in the command results?

Show
Christopher Waters added a comment - From the slack link, I was able to run the command: setup lsst_distrib setup -vkr /project/krzys001/ap_verify_ci_hits2015/ setup -vkr /scratch/krzys001/daf_butler/ setup -vkr /scratch/krzys001/obs_base/ setup -vkr /scratch/krzys001/obs_decam/ cd /scratch/krzys001/gen3demo/ python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml -C calibrate:config/calibrate.py This results in the empty quantum graph, which I believe I would expect from a call to pipetask with no data constraints.  Adding the constraint -d 'detector=10 exposure=419802' results in a quantum graph with three tasks, corresponding to my expectation for ProcessCcd. I don't see an issue with what is happening.  With no constraints, I would expect the graph to iterate over the "exposure" and "detector" table entries, and generate the set of jobs that can be done.  Blocking on a missing raw file seems like the correct response to me;  the implied request was to process everything, and that cannot be successfully done. Is there an issue of a difference in expectations in the command results?
Hide
Andy Salnikov added a comment -

Hmm, repeating exactly the same commands I get:

 Failed to build graph: envelope(): incompatible function arguments. The following argument types are supported:  1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None Traceback (most recent call last):  File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/bin/pipetask", line 26, in   sys.exit(CmdLineFwk().parseAndRun())  File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 156, in parseAndRun  qgraph = self.makeGraph(pipeline, args)  File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 340, in makeGraph  qgraph = graphBuilder.makeGraph(pipeline, inputCollections, outputCollection, args.data_query)  File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 788, in makeGraph  skipExisting=self.skipExisting)  File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 677, in fillQuanta  expand=True,  File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/_registry.py", line 1307, in queryDatasets  query = builder.finish()  File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 358, in finish  parameters = self._addWhereClause()  File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 289, in _addWhereClause  for begin, end in dimension.pixelization.envelope(self.summary.dataId.region): TypeError: envelope(): incompatible function arguments. The following argument types are supported:  1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None 

Do you use the same stack version?

When there is no data constraint I would expect it to produce a graph including all available data, not an empty graph.

Show
Andy Salnikov added a comment - Hmm, repeating exactly the same commands I get: Failed to build graph: envelope(): incompatible function arguments. The following argument types are supported: 1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None Traceback (most recent call last): File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/bin/pipetask", line 26, in <module> sys.exit(CmdLineFwk().parseAndRun()) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 156, in parseAndRun qgraph = self.makeGraph(pipeline, args) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 340, in makeGraph qgraph = graphBuilder.makeGraph(pipeline, inputCollections, outputCollection, args.data_query) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 788, in makeGraph skipExisting=self.skipExisting) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 677, in fillQuanta expand=True, File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/_registry.py", line 1307, in queryDatasets query = builder.finish() File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 358, in finish parameters = self._addWhereClause() File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 289, in _addWhereClause for begin, end in dimension.pixelization.envelope(self.summary.dataId.region): TypeError: envelope(): incompatible function arguments. The following argument types are supported: 1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None Do you use the same stack version? When there is no data constraint I would expect it to produce a graph including all available data, not an empty graph.
Hide
Krzysztof Findeisen added a comment -

 setup lsst_distrib setup -vkr /project/krzys001/ap_verify_ci_hits2015/ setup -vkr /scratch/krzys001/daf_butler/ setup -vkr /scratch/krzys001/obs_base/ setup -vkr /scratch/krzys001/obs_decam/ cd /scratch/krzys001/gen3demo/ python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --input DECam --output processed --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml --configfile calibrate:config/calibrate.py --register-dataset-types --clobber-output 

Also please note that /scratch/krzys001/gen3demo/gen3 is actively being developed, so I can't guarantee that this issue will always be reproducible with that repository.

Show
Krzysztof Findeisen added a comment - Please use a correct command to pipetask : setup lsst_distrib setup -vkr /project/krzys001/ap_verify_ci_hits2015/ setup -vkr /scratch/krzys001/daf_butler/ setup -vkr /scratch/krzys001/obs_base/ setup -vkr /scratch/krzys001/obs_decam/ cd /scratch/krzys001/gen3demo/ python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --input DECam --output processed --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml --configfile calibrate:config/calibrate.py --register-dataset-types --clobber-output Also please note that /scratch/krzys001/gen3demo/gen3 is actively being developed, so I can't guarantee that this issue will always be reproducible with that repository.
 Watchers Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen [ Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen ] Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen, Tim Jenness [ Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen, Tim Jenness ]
Hide
John Parejko added a comment -

"Blocking on a missing raw file seems like the correct response to me" - what does "missing" mean in this context? I would have thought it meant that the raw file wasn't ingested, and thus has no references in the registry (as is the case here). If something is in the registry, but not on disk, that is a separate problem. In this case, there are no such raws for that visit+detector combination (they've been stripped form the multi-extension FITS file), and they are not in the registry.

Your comment implies that we cannot run the pipeline without providing a list of detectors if e.g. one detector failed to produce data.

Show
John Parejko added a comment - "Blocking on a missing raw file seems like the correct response to me" - what does "missing" mean in this context? I would have thought it meant that the raw file wasn't ingested, and thus has no references in the registry (as is the case here). If something is in the registry, but not on disk, that is a separate problem. In this case, there are no such raws for that visit+detector combination (they've been stripped form the multi-extension FITS file), and they are not in the registry. Your comment implies that we cannot run the pipeline without providing a list of detectors if e.g. one detector failed to produce data.
 Summary IsrTaks shoud use regular Input for raw data IsrTask shoud use regular Input for raw data
Hide
Tim Jenness added a comment -

It definitely seems odd to me that we can't say "use whatever detectors we have available" and instead have to choose either all detectors even if they don't exist, or an explicit list of detectors.

Show
Tim Jenness added a comment - It definitely seems odd to me that we can't say "use whatever detectors we have available" and instead have to choose either all detectors even if they don't exist, or an explicit list of detectors.
Hide
Christopher Waters added a comment -

I'm using "missing" here to denote a raw file as missing if an element of the Cartesian product of the "exposure" and "detector"  tables does not exist.  This is probably overly restrictive (given that cameras can fail to read out detector data), but hasn't been an issue for me before now.  I suspect this behavior for an unconstrained data list was chosen for speed/database reasons.

I hadn't worried about this case before now, because my assumption was that in gen3, everyone would be working from the same complete repos.  I'd support an option to request the butler do what it can instead of working from the dimension tables, but I have no idea what the consequences of that would be.

Show
Christopher Waters added a comment - I'm using "missing" here to denote a raw file as missing if an element of the Cartesian product of the "exposure" and "detector"  tables does not exist.  This is probably overly restrictive (given that cameras can fail to read out detector data), but hasn't been an issue for me before now.  I suspect this behavior for an unconstrained data list was chosen for speed/database reasons. I hadn't worried about this case before now, because my assumption was that in gen3, everyone would be working from the same complete repos.  I'd support an option to request the butler do what it can instead of working from the dimension tables, but I have no idea what the consequences of that would be.
Hide
Jim Bosch added a comment -

Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask.  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here.

I also think making raw a regular input makes IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask.

Show
Jim Bosch added a comment - Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask .  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the  -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the  -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here. I also think making  raw a regular input makes  IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask .
Hide
Jim Bosch added a comment -

For a little more background, the Cartesian-product logic that's causing problems here exists because:

- for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);

- I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.

Show
Jim Bosch added a comment - For a little more background, the Cartesian-product logic that's causing problems here exists because:  - for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);  - I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.
Hide
Krzysztof Findeisen added a comment -

The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter.

I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.

Show
Krzysztof Findeisen added a comment - The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter. I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.
Hide
Christopher Waters added a comment -

I have a feeling that I'm the only major opponent to the PrerequisiteInput -> Input migration, and so will implement this ticket without an RFC.  I have also filed DM-23765 to point out that this can create unwanted massive processing jobs.

Show
Christopher Waters added a comment - I have a feeling that I'm the only major opponent to the PrerequisiteInput -> Input migration, and so will implement this ticket without an RFC.  I have also filed  DM-23765 to point out that this can create unwanted massive processing jobs.
 Link This issue is triggering DM-23765 [ DM-23765 ]
 Status To Do [ 10001 ] In Progress [ 3 ]
 Reviewers Andy Salnikov [ salnikov ] Status In Progress [ 3 ] In Review [ 10004 ]
Hide
Andy Salnikov added a comment -

Show
 Status In Review [ 10004 ] Reviewed [ 10101 ]
 Resolution Done [ 10000 ] Status Reviewed [ 10101 ] Done [ 10002 ]
 Story Points 1
 Description Currently IstTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used. Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700 Currently IsrTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used. Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700
 Epic Link DM-22586 [ 427653 ]
 Team Data Release Production [ 10301 ]

## People

• Assignee:
Christopher Waters
Reporter:
Andy Salnikov
Reviewers:
Andy Salnikov
Watchers:
Andy Salnikov, Christopher Waters, Jim Bosch, John Parejko, Krzysztof Findeisen, Tim Jenness