Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23702

IsrTask shoud use regular Input for raw data

    Details

      Description

      Currently IsrTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used.

      Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700

        Attachments

          Issue Links

            Activity

            salnikov Andy Salnikov created issue -
            czw Christopher Waters made changes -
            Field Original Value New Value
            Assignee Chris Walter [ cwalter ] Christopher Waters [ cwaters ]
            Hide
            czw Christopher Waters added a comment -

            From the slack link, I was able to run the command:

             setup lsst_distrib
             setup -vkr /project/krzys001/ap_verify_ci_hits2015/
             setup -vkr /scratch/krzys001/daf_butler/
             setup -vkr /scratch/krzys001/obs_base/
             setup -vkr /scratch/krzys001/obs_decam/
             cd /scratch/krzys001/gen3demo/
             python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline ${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml -C calibrate:config/calibrate.py

            This results in the empty quantum graph, which I believe I would expect from a call to pipetask with no data constraints.  Adding the constraint

            -d 'detector=10 exposure=419802'

            results in a quantum graph with three tasks, corresponding to my expectation for ProcessCcd.

            I don't see an issue with what is happening.  With no constraints, I would expect the graph to iterate over the "exposure" and "detector" table entries, and generate the set of jobs that can be done.  Blocking on a missing raw file seems like the correct response to me;  the implied request was to process everything, and that cannot be successfully done.

            Is there an issue of a difference in expectations in the command results?

             

            Show
            czw Christopher Waters added a comment - From the slack link, I was able to run the command: setup lsst_distrib setup -vkr /project/krzys001/ap_verify_ci_hits2015/ setup -vkr /scratch/krzys001/daf_butler/ setup -vkr /scratch/krzys001/obs_base/ setup -vkr /scratch/krzys001/obs_decam/ cd /scratch/krzys001/gen3demo/ python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline ${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml -C calibrate:config/calibrate.py This results in the empty quantum graph, which I believe I would expect from a call to pipetask with no data constraints.  Adding the constraint -d 'detector=10 exposure=419802' results in a quantum graph with three tasks, corresponding to my expectation for ProcessCcd. I don't see an issue with what is happening.  With no constraints, I would expect the graph to iterate over the "exposure" and "detector" table entries, and generate the set of jobs that can be done.  Blocking on a missing raw file seems like the correct response to me;  the implied request was to process everything, and that cannot be successfully done. Is there an issue of a difference in expectations in the command results?  
            Hide
            salnikov Andy Salnikov added a comment -

            Hmm, repeating exactly the same commands I get:

            Failed to build graph: envelope(): incompatible function arguments. The following argument types are supported:
                1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet
             
            Invoked with: HtmPixelization(7), None
            Traceback (most recent call last):
              File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/bin/pipetask", line 26, in <module>
                sys.exit(CmdLineFwk().parseAndRun())
              File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 156, in parseAndRun
                qgraph = self.makeGraph(pipeline, args)
              File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 340, in makeGraph
                qgraph = graphBuilder.makeGraph(pipeline, inputCollections, outputCollection, args.data_query)
              File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 788, in makeGraph
                skipExisting=self.skipExisting)
              File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 677, in fillQuanta
                expand=True,
              File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/_registry.py", line 1307, in queryDatasets
                query = builder.finish()
              File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 358, in finish
                parameters = self._addWhereClause()
              File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 289, in _addWhereClause
                for begin, end in dimension.pixelization.envelope(self.summary.dataId.region):
            TypeError: envelope(): incompatible function arguments. The following argument types are supported:
                1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet
             
            Invoked with: HtmPixelization(7), None
            

            Do you use the same stack version?

            When there is no data constraint I would expect it to produce a graph including all available data, not an empty graph.

            Show
            salnikov Andy Salnikov added a comment - Hmm, repeating exactly the same commands I get: Failed to build graph: envelope(): incompatible function arguments. The following argument types are supported: 1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None Traceback (most recent call last): File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/bin/pipetask", line 26, in <module> sys.exit(CmdLineFwk().parseAndRun()) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 156, in parseAndRun qgraph = self.makeGraph(pipeline, args) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/ctrl_mpexec/19.0.0-9-g3f34418+2/python/lsst/ctrl/mpexec/cmdLineFwk.py", line 340, in makeGraph qgraph = graphBuilder.makeGraph(pipeline, inputCollections, outputCollection, args.data_query) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 788, in makeGraph skipExisting=self.skipExisting) File "/software/lsstsw/stack_20200220/stack/miniconda3-4.7.12-984c9f7/Linux64/pipe_base/19.0.0-9-g0ae078d+4/python/lsst/pipe/base/graphBuilder.py", line 677, in fillQuanta expand=True, File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/_registry.py", line 1307, in queryDatasets query = builder.finish() File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 358, in finish parameters = self._addWhereClause() File "/scratch/krzys001/daf_butler/python/lsst/daf/butler/registry/queries/_builder.py", line 289, in _addWhereClause for begin, end in dimension.pixelization.envelope(self.summary.dataId.region): TypeError: envelope(): incompatible function arguments. The following argument types are supported: 1. (self: lsst.sphgeom.pixelization.Pixelization, region: lsst.sphgeom.region.Region, maxRanges: int=0) -> lsst::sphgeom::RangeSet   Invoked with: HtmPixelization(7), None Do you use the same stack version? When there is no data constraint I would expect it to produce a graph including all available data, not an empty graph.
            Hide
            krzys Krzysztof Findeisen added a comment -

            Please use a correct command to pipetask:

            setup lsst_distrib
            setup -vkr /project/krzys001/ap_verify_ci_hits2015/
            setup -vkr /scratch/krzys001/daf_butler/
            setup -vkr /scratch/krzys001/obs_base/
            setup -vkr /scratch/krzys001/obs_decam/
            cd /scratch/krzys001/gen3demo/
            python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --input DECam --output processed --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline ${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml --configfile calibrate:config/calibrate.py --register-dataset-types --clobber-output
            

            Also please note that /scratch/krzys001/gen3demo/gen3 is actively being developed, so I can't guarantee that this issue will always be reproducible with that repository.

            Show
            krzys Krzysztof Findeisen added a comment - Please use a correct command to pipetask : setup lsst_distrib setup -vkr /project/krzys001/ap_verify_ci_hits2015/ setup -vkr /scratch/krzys001/daf_butler/ setup -vkr /scratch/krzys001/obs_base/ setup -vkr /scratch/krzys001/obs_decam/ cd /scratch/krzys001/gen3demo/ python -m ipdb $CTRL_MPEXEC_DIR/bin/pipetask run --input DECam --output processed --instrument lsst.obs.decam.DarkEnergyCamera --butler-config gen3/butler.yaml --pipeline ${PIPE_TASKS_DIR}/pipelines/ProcessCcd.yaml --configfile calibrate:config/calibrate.py --register-dataset-types --clobber-output Also please note that /scratch/krzys001/gen3demo/gen3 is actively being developed, so I can't guarantee that this issue will always be reproducible with that repository.
            Parejkoj John Parejko made changes -
            Watchers Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen [ Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen ] Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen, Tim Jenness [ Andy Salnikov, Christopher Waters, John Parejko, Krzysztof Findeisen, Tim Jenness ]
            Hide
            Parejkoj John Parejko added a comment -

            "Blocking on a missing raw file seems like the correct response to me" - what does "missing" mean in this context? I would have thought it meant that the raw file wasn't ingested, and thus has no references in the registry (as is the case here). If something is in the registry, but not on disk, that is a separate problem. In this case, there are no such raws for that visit+detector combination (they've been stripped form the multi-extension FITS file), and they are not in the registry.

            Your comment implies that we cannot run the pipeline without providing a list of detectors if e.g. one detector failed to produce data.

            Show
            Parejkoj John Parejko added a comment - "Blocking on a missing raw file seems like the correct response to me" - what does "missing" mean in this context? I would have thought it meant that the raw file wasn't ingested, and thus has no references in the registry (as is the case here). If something is in the registry, but not on disk, that is a separate problem. In this case, there are no such raws for that visit+detector combination (they've been stripped form the multi-extension FITS file), and they are not in the registry. Your comment implies that we cannot run the pipeline without providing a list of detectors if e.g. one detector failed to produce data.
            tjenness Tim Jenness made changes -
            Summary IsrTaks shoud use regular Input for raw data IsrTask shoud use regular Input for raw data
            Hide
            tjenness Tim Jenness added a comment -

            It definitely seems odd to me that we can't say "use whatever detectors we have available" and instead have to choose either all detectors even if they don't exist, or an explicit list of detectors.

            Show
            tjenness Tim Jenness added a comment - It definitely seems odd to me that we can't say "use whatever detectors we have available" and instead have to choose either all detectors even if they don't exist, or an explicit list of detectors.
            Hide
            czw Christopher Waters added a comment -

            I'm using "missing" here to denote a raw file as missing if an element of the Cartesian product of the "exposure" and "detector"  tables does not exist.  This is probably overly restrictive (given that cameras can fail to read out detector data), but hasn't been an issue for me before now.  I suspect this behavior for an unconstrained data list was chosen for speed/database reasons.

            I hadn't worried about this case before now, because my assumption was that in gen3, everyone would be working from the same complete repos.  I'd support an option to request the butler do what it can instead of working from the dimension tables, but I have no idea what the consequences of that would be.

            Show
            czw Christopher Waters added a comment - I'm using "missing" here to denote a raw file as missing if an element of the Cartesian product of the "exposure" and "detector"  tables does not exist.  This is probably overly restrictive (given that cameras can fail to read out detector data), but hasn't been an issue for me before now.  I suspect this behavior for an unconstrained data list was chosen for speed/database reasons. I hadn't worried about this case before now, because my assumption was that in gen3, everyone would be working from the same complete repos.  I'd support an option to request the butler do what it can instead of working from the dimension tables, but I have no idea what the consequences of that would be.
            Hide
            jbosch Jim Bosch added a comment -

            Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask.  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here.

            I also think making raw a regular input makes IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask.

             

            Show
            jbosch Jim Bosch added a comment - Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask .  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the  -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the  -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here. I also think making  raw a regular input makes  IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask .  
            Hide
            jbosch Jim Bosch added a comment -

            For a little more background, the Cartesian-product logic that's causing problems here exists because:

             - for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);

             - I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.

             

            Show
            jbosch Jim Bosch added a comment - For a little more background, the Cartesian-product logic that's causing problems here exists because:  - for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);  - I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.  
            Hide
            krzys Krzysztof Findeisen added a comment -

            The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter.

            I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.

            Show
            krzys Krzysztof Findeisen added a comment - The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter. I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.
            Hide
            czw Christopher Waters added a comment -

            I have a feeling that I'm the only major opponent to the `PrerequisiteInput` -> `Input` migration, and so will implement this ticket without an RFC.  I have also filed DM-23765 to point out that this can create unwanted massive processing jobs.  

            Show
            czw Christopher Waters added a comment - I have a feeling that I'm the only major opponent to the `PrerequisiteInput` -> `Input` migration, and so will implement this ticket without an RFC.  I have also filed  DM-23765 to point out that this can create unwanted massive processing jobs.  
            czw Christopher Waters made changes -
            Link This issue is triggering DM-23765 [ DM-23765 ]
            czw Christopher Waters made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            czw Christopher Waters made changes -
            Reviewers Andy Salnikov [ salnikov ]
            Status In Progress [ 3 ] In Review [ 10004 ]
            Hide
            salnikov Andy Salnikov added a comment -

            Looks good, no comments.

            Show
            salnikov Andy Salnikov added a comment - Looks good, no comments.
            salnikov Andy Salnikov made changes -
            Status In Review [ 10004 ] Reviewed [ 10101 ]
            czw Christopher Waters made changes -
            Resolution Done [ 10000 ]
            Status Reviewed [ 10101 ] Done [ 10002 ]
            czw Christopher Waters made changes -
            Story Points 1
            swinbank John Swinbank made changes -
            Description Currently IstTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used.

            Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700
            Currently IsrTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used.

            Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700
            swinbank John Swinbank made changes -
            Epic Link DM-22586 [ 427653 ]
            swinbank John Swinbank made changes -
            Team Data Release Production [ 10301 ]

              People

              • Assignee:
                czw Christopher Waters
                Reporter:
                salnikov Andy Salnikov
                Reviewers:
                Andy Salnikov
                Watchers:
                Andy Salnikov, Christopher Waters, Jim Bosch, John Parejko, Krzysztof Findeisen, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel