Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23139

Check whether the gen3 butler pre-filter template inputs and defer their loading if not

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ip_diffim
    • Labels:
      None

      Description

      As of the currently available limited ci_hsc gen3 repo we have a very limited number of template images, however a production environment gen3 collection may contain many templates, covering large sky areas.

       

      As a follow up of the initial gen3 support DM-22541 in ImageDifferenceTask, check whether the number of template inputs can potentially be huge in a production environment or not because of gen3 automatic filtering on patch spatial information and science image coordinate matching.

       

      If all available template images are potentially loaded as task input, add support for deferred loading of the template coadd images to avoid unnecessary I/O and save resources. Load only at a later stage, if input belongs to a relevant skymap patch. If the coadd images loaded as inputs are pre-filtered at the middleware level then we perhaps do not need to implement any cautious approach in ImageDifferenceTask.

       

        Attachments

          Issue Links

            Activity

            Hide
            swinbank John Swinbank added a comment -

            “On the other”.... what?

            Show
            swinbank John Swinbank added a comment - “On the other”.... what?
            Hide
            gkovacs Gabor Kovacs added a comment -

            We've discussed this issue on the gen3 middleware meeting on 2020-01-23. The current understanding is that the middleware automatically recognizes spatial relationship among inputs thus not all patches are loaded and provided as input.

            The still open parts of the question:

            • Is it actually possible to have multiple coadds available in a collection for the same sky area in the same tiling (same skymap) ? If yes, how to distinguish two coadds? The current dimensions "tract", "patch", "skymap", "abstract_filter" do not constrain this.
            • If yes, can we realistically assume that so many coadds will be available in the production environment that it is unpractical or unaffordable to load all matching patches for one calexp.
            Show
            gkovacs Gabor Kovacs added a comment - We've discussed this issue on the gen3 middleware meeting on 2020-01-23. The current understanding is that the middleware automatically recognizes spatial relationship among inputs thus not all patches are loaded and provided as input. The still open parts of the question: Is it actually possible to have multiple coadds available in a collection for the same sky area in the same tiling (same skymap) ? If yes, how to distinguish two coadds? The current dimensions "tract", "patch", "skymap", "abstract_filter" do not constrain this. If yes, can we realistically assume that so many coadds will be available in the production environment that it is unpractical or unaffordable to load all matching patches for one calexp.
            Hide
            gkovacs Gabor Kovacs added a comment - - edited

            Summary from slack:

            Q: If multiple coadds can be stored in one collection?
            The current dimensions "tract", "patch", "skymap", "abstract_filter" do not constrain this.

            A:Jim Bosch I believe in this case it would find all matching datasets in the first collection (each of which must have a different data ID), then find all matching datasets in the next collection whose data IDs don't conflict with those already found, and then repeat that process until all collections have been searched.

            I would not recommend actually using it this way; this is an excellent example of why the multiple-input-collections option is primarily intended for cases where the collections hold completely disjoint dataset types.

            • In one collection, there cannot exist two instances of the same data type where all dimensions are equal. The templating makes it possible to have different data types easily, possibly marking the creation date or other feature of one coadd. With different data type names they can coexist in one collection.
            • The purpose of specifying multiple collections to a pipeline is to keep the different data types separated (e.g. science calexps, coadds, ref. catalogs)
            • If the same data type exists in multiple given collections (which should be avoided), then the first found instance is read as input, for multiple inputs, one instance of each occurrence described by the dimension keys. For coadds this should be still problematic, if two instances do not cover exactly the same tract,patch pairs, so must be avoided. (Otherwise, as I understand, the search order of the collections are deterministic.)
            Show
            gkovacs Gabor Kovacs added a comment - - edited Summary from slack: Q : If multiple coadds can be stored in one collection? The current dimensions "tract", "patch", "skymap", "abstract_filter" do not constrain this. A: Jim Bosch I believe in this case it would find all matching datasets in the first collection (each of which must have a different data ID), then find all matching datasets in the next collection whose data IDs don't conflict with those already found, and then repeat that process until all collections have been searched. I would not recommend actually using it this way; this is an excellent example of why the multiple-input-collections option is primarily intended for cases where the collections hold completely disjoint dataset types. In one collection , there cannot exist two instances of the same data type where all dimensions are equal. The templating makes it possible to have different data types easily, possibly marking the creation date or other feature of one coadd. With different data type names they can coexist in one collection. The purpose of specifying multiple collections to a pipeline is to keep the different data types separated (e.g. science calexps, coadds, ref. catalogs) If the same data type exists in multiple given collections (which should be avoided), then the first found instance is read as input, for multiple inputs, one instance of each occurrence described by the dimension keys. For coadds this should be still problematic, if two instances do not cover exactly the same tract,patch pairs, so must be avoided. (Otherwise, as I understand, the search order of the collections are deterministic.)
            Hide
            gkovacs Gabor Kovacs added a comment -

            The bottom line is that we don't need to implement anything at the moment, if the need of choosing from many coadds emerges, we probably need a new dimension to distinguish the coadd instances.

            Show
            gkovacs Gabor Kovacs added a comment - The bottom line is that we don't need to implement anything at the moment, if the need of choosing from many coadds emerges, we probably need a new dimension to distinguish the coadd instances.
            Hide
            gkovacs Gabor Kovacs added a comment -

            I think this can be now closed. Please confirm.

            Show
            gkovacs Gabor Kovacs added a comment - I think this can be now closed. Please confirm.
            Hide
            swinbank John Swinbank added a comment -

            Show
            swinbank John Swinbank added a comment -

              People

              • Assignee:
                gkovacs Gabor Kovacs
                Reporter:
                gkovacs Gabor Kovacs
                Reviewers:
                John Swinbank
                Watchers:
                Gabor Kovacs, John Swinbank
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel