Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23702

IsrTask shoud use regular Input for raw data

    Details

      Description

      Currently IsrTask connection are defined with all inputs being of PrerequisiteInput type. It should instead be using regular Input type for at least a "raw" type top constrain its inputs to only existing inputs, otherwise it will result in all possible combinations of visits/detectors being used.

      Slack link: https://lsstc.slack.com/archives/C2JPT1KB7/p1582938474109700

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            Looks good, no comments.

            Show
            salnikov Andy Salnikov added a comment - Looks good, no comments.
            Hide
            czw Christopher Waters added a comment -

            I have a feeling that I'm the only major opponent to the `PrerequisiteInput` -> `Input` migration, and so will implement this ticket without an RFC.  I have also filed DM-23765 to point out that this can create unwanted massive processing jobs.  

            Show
            czw Christopher Waters added a comment - I have a feeling that I'm the only major opponent to the `PrerequisiteInput` -> `Input` migration, and so will implement this ticket without an RFC.  I have also filed  DM-23765 to point out that this can create unwanted massive processing jobs.  
            Hide
            krzys Krzysztof Findeisen added a comment -

            The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter.

            I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.

            Show
            krzys Krzysztof Findeisen added a comment - The big drawback of that approach is that it makes it easy to accidentally... ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all. I'd certainly love to have others think on how to avoid the former while permitting the latter. I suggest putting this question to an RFC. In my (also limited) experience we usually do want to process datasets in bulk, and one of the big selling points of Gen 3 was the claim that we would not need to configure pipelines with long lists of data IDs, like we sometimes need to in Gen 2.
            Hide
            jbosch Jim Bosch added a comment -

            For a little more background, the Cartesian-product logic that's causing problems here exists because:

             - for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);

             - I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.

             

            Show
            jbosch Jim Bosch added a comment - For a little more background, the Cartesian-product logic that's causing problems here exists because:  - for other dimensions, starting from a Cartesian product is what we want, as it's precisely what lets us generate reasonable output data IDs before the datasets for them ever could exist (e.g. make coadds for all combinations of tract+patch+filter that could be produced from otherwise-constrained inputs);  - I would prefer not to add special-casing to the exposure and/or detector dimensions to make them behave differently, especially because we already have a dataset (raw) whose existence constrains the Cartesian product down to what actually exists in the repository and collection.  
            Hide
            jbosch Jim Bosch added a comment -

            Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask.  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here.

            I also think making raw a regular input makes IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask.

             

            Show
            jbosch Jim Bosch added a comment - Christopher Waters and I discussed this offline, and while I think there are arguments for both sides, my preference is probably to do as this ticket requests, and make raw a regular (non-prerequisite) input of IsrTask .  The big drawback of that approach is that it makes it easy to accidentally (i.e. by leaving off the  -d argument entirely) ask for all data in a huge input collection to be processed; the big advantage is that it allows one to define a small collection of inputs to be processed and then use it with no data ID expression at all.  I'd certainly love to have others think on how to avoid the former while permitting the latter.  One possibility is to make the  -d option required and have an explicit special expression that means "everything"; while the current approach in which no expression implies "everything" is mathematically natural (given that nontrivial expressions represent constraints, and a lack of constraints implies everything), perhaps practicality should trump naturalness here. I also think making  raw a regular input makes  IsrTask feel like it behaves somewhat more like other PipelineTasks (also good), though it's always going to be somewhat special in that raw is never going to be produced by another PipelineTask .  

              People

              • Assignee:
                czw Christopher Waters
                Reporter:
                salnikov Andy Salnikov
                Reviewers:
                Andy Salnikov
                Watchers:
                Andy Salnikov, Christopher Waters, Jim Bosch, John Parejko, Krzysztof Findeisen, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel