Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-825

Storage concerns for deblender outputs

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      When DM-32436 is implemented ScarletDeblendTask will once again return two different output catalogs: one is a set catalog for each band using the scarlet models (convolved to the observed seeing) and the other is a set of catalogs for each band that uses the scarlet models to re-weight the observed images so that flux is conserved (similar to the SDSS and old HSC deblender outputs). Both sets of catalogs are identical except for their heavy footprints (even though the spansets for each footprint are identical).

      The reviewer for DM-32436 brought up the excellent point that this is redundant information, since the catalog with the scarlet models can be easily used to generate the conserved flux footprints and might not warrant doubling the output from the deblender. So I'm wondering if we should continue to just output the scarlet models and have downstream measurement tasks contain a config parameter to re-weight the deblender outputs, which they would do after the catalog has been loaded? The main counter-argument that I can imagine is that if we have to do this multiple times this might begin to become computationally expensive, so I'm curious if anyone knows how frequently we load catalogs from disk in the DRP pipeline, for example.

      Can anyone think of any other arguments for or against producing a single deblender model catalog in each band and doing the re-weighting when the data is loaded (if so desired)?

        Attachments

          Issue Links

            Activity

            Hide
            fred3m Fred Moolekamp added a comment -

            Dan Taranu brought up a good point in a discussion the other day, which is that right now we are storing only the convolved scarlet models. The problem with this is that if anyone wants to make measurements on the deconvolved models they are SOL, since those are discarded. One motivation for storing the scarlet deconvolved models is for future processing. For example, in DR2 we will likely reach faster convergence using the DR1 models for sources (when possible) since they will already be deconvolved and the galaxy models shouldn't change much except in the low SNR regions. As far as I can tell, the only task that loads the deblender models in DRP is lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, so it looks to me like convolutions and re-distributing flux based on scarlet models would only done once, if at all.

            So I want to propose the following:

            • The deblender will output a single catalog (deblendedModel) that has an entry for each component. Each component record will have a 1D SED array column (it already exists) and a 2D morphology HeavyFootprint that is the deconvolved scarlet model for the component. There will also be an entry in the catalog for each source, where a source is a parent of one or more components. The sources will have a parent that is the parent of the entire blend, and each component will have a parent that is the source containing that component.
            • A SourceCatalog.renderCatalog(Exposure exposure, bool convserveFlux) method will be created to convert the footprints in the deblendedModel catalog into a catalog for a given band, where exposure is the single band exposure to render the models into (needed for its PSF) and conserveFlux is a boolean value that determines whether or not flux should be re-distributed from the exposure using the models as templates. An isRendered property will be added to SourceCatalog to let the user know if the catalog has already been rendered and prevent multiple convolutions, and an isConserved parameter will let the user know if the catalog contains raw models or flux-conserved models.
            • The deblendedModel catalog will be the input for lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, which will have config options to specify whether or not to convolveScarletModels or conserveFlux and will call deblenedModel.renderCatalog. Once the rendered catalog has been loaded into memory it can be passed to the measurement algorithms as necessary.

            Thoughts?

            Show
            fred3m Fred Moolekamp added a comment - Dan Taranu brought up a good point in a discussion the other day, which is that right now we are storing only the convolved scarlet models. The problem with this is that if anyone wants to make measurements on the deconvolved models they are SOL, since those are discarded. One motivation for storing the scarlet deconvolved models is for future processing. For example, in DR2 we will likely reach faster convergence using the DR1 models for sources (when possible) since they will already be deconvolved and the galaxy models shouldn't change much except in the low SNR regions. As far as I can tell, the only task that loads the deblender models in DRP is lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask , so it looks to me like convolutions and re-distributing flux based on scarlet models would only done once, if at all. So I want to propose the following: The deblender will output a single catalog ( deblendedModel ) that has an entry for each component. Each component record will have a 1D SED array column (it already exists) and a 2D morphology HeavyFootprint that is the deconvolved scarlet model for the component. There will also be an entry in the catalog for each source, where a source is a parent of one or more components. The sources will have a parent that is the parent of the entire blend, and each component will have a parent that is the source containing that component. A SourceCatalog.renderCatalog(Exposure exposure, bool convserveFlux) method will be created to convert the footprints in the deblendedModel catalog into a catalog for a given band, where exposure is the single band exposure to render the models into (needed for its PSF) and conserveFlux is a boolean value that determines whether or not flux should be re-distributed from the exposure using the models as templates. An isRendered property will be added to SourceCatalog to let the user know if the catalog has already been rendered and prevent multiple convolutions, and an isConserved parameter will let the user know if the catalog contains raw models or flux-conserved models. The deblendedModel catalog will be the input for lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask , which will have config options to specify whether or not to convolveScarletModels or conserveFlux and will call deblenedModel.renderCatalog . Once the rendered catalog has been loaded into memory it can be passed to the measurement algorithms as necessary. Thoughts?
            Hide
            ktl Kian-Tat Lim added a comment -

            Using previous-DR outputs to "seed" the current DR is potentially fraught with provenance and reproducibility problems, e.g. if a DAC offers only the latest DR. It also has the potential for locking in incorrect local minima that might otherwise be avoided with more data. So we should be very careful before using this as a performance enhancement.

            Show
            ktl Kian-Tat Lim added a comment - Using previous-DR outputs to "seed" the current DR is potentially fraught with provenance and reproducibility problems, e.g. if a DAC offers only the latest DR. It also has the potential for locking in incorrect local minima that might otherwise be avoided with more data. So we should be very careful before using this as a performance enhancement.
            Hide
            fred3m Fred Moolekamp added a comment -

            I hadn't thought about the provenance issue but I did think about the local minima. I'm not too concerned about that because we don't have to use the exact model from the previous DR (and wouldn't want to for a number of reasons). But the key issue with initialization is that we are initializing a scarlet model in a partially deconvolved space with a set of observed images, so our initial images are always puffy. So my thought is that we can use the previous models with a sparsity constraint to correctly model the centers of sources as an effective deconvolution, with perhaps some noise on both the morphology and spectral arrays to allow them more flexibility to move out of a local minima. But I agree that this has to be done with care and compared to results using models initialized from scratch to make sure that we aren't biasing our results.

            Show
            fred3m Fred Moolekamp added a comment - I hadn't thought about the provenance issue but I did think about the local minima. I'm not too concerned about that because we don't have to use the exact model from the previous DR (and wouldn't want to for a number of reasons). But the key issue with initialization is that we are initializing a scarlet model in a partially deconvolved space with a set of observed images, so our initial images are always puffy. So my thought is that we can use the previous models with a sparsity constraint to correctly model the centers of sources as an effective deconvolution, with perhaps some noise on both the morphology and spectral arrays to allow them more flexibility to move out of a local minima. But I agree that this has to be done with care and compared to results using models initialized from scratch to make sure that we aren't biasing our results.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I'm in favor of the new proposal, and those really feel like the only deblender output we should consider storing for delivery to science users (I think making deblender outputs a purely virtual dataset, i.e. something we always recompute from the image and catalog upon request, is also still in play).

            Like Kian-Tat Lim, I am very skeptical of using those across DRs; note that we won't even have the same set of proto-objects detected, so matching would be problematic, even if provenance and local minima were not (and think they are a problem from the perspective of exact reproducibility, even if they aren't necessarily a problem from the perspective of scientific biases). But storing the source of truth is important in its own right, and may be relevant for downstream pipeline steps, like projecting deblending to short-period coadds for forced photometry. And I bet those deblended models will be (or can be made to be) much smaller than any of the PSF-convolved derived HeavyFootprints.

            All that doesn't mean saving derived HeavyFootprints is out of the question - I think it's pretty likely that we'll want them as intermediates to avoid recomputing repeatedly in multiple downstream tasks. But I think we should try to think of that as an optimization in CPU/storage tradeoff space, so let's start without saving them and see how bad the CPU costs are.

            Show
            jbosch Jim Bosch added a comment - - edited I'm in favor of the new proposal, and those really feel like the only deblender output we should consider storing for delivery to science users (I think making deblender outputs a purely virtual dataset, i.e. something we always recompute from the image and catalog upon request, is also still in play). Like Kian-Tat Lim , I am very skeptical of using those across DRs; note that we won't even have the same set of proto-objects detected, so matching would be problematic, even if provenance and local minima were not (and think they are a problem from the perspective of exact reproducibility, even if they aren't necessarily a problem from the perspective of scientific biases). But storing the source of truth is important in its own right, and may be relevant for downstream pipeline steps, like projecting deblending to short-period coadds for forced photometry. And I bet those deblended models will be (or can be made to be) much smaller than any of the PSF-convolved derived HeavyFootprints. All that doesn't mean saving derived HeavyFootprints is out of the question - I think it's pretty likely that we'll want them as intermediates to avoid recomputing repeatedly in multiple downstream tasks. But I think we should try to think of that as an optimization in CPU/storage tradeoff space, so let's start without saving them and see how bad the CPU costs are.
            Hide
            fred3m Fred Moolekamp added a comment -

            We can talk more about processing across DRs another time. I'll just note that one good way to do source detection in future DRs might be to subtract all of the scarlet models from the previous DR and look for missing sources in the residuals in a type of difference imaging. Of course there are complications to this but it still might be worth looking into.

            And yes Jim Bosch, the scarlet outputs are guaranteed to be smaller by the PSF radius in all directions, which is significant for faint and point-like sources (which is the majority of objects).

            If there are no objections by Monday I'll adopt this RFC and begin to update DM-32436 accordingly to implement the single catalog deblender output with a renderCatalog method as outlined above in https://jira.lsstcorp.org/browse/RFC-825?focusedCommentId=500780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-500780.

            Show
            fred3m Fred Moolekamp added a comment - We can talk more about processing across DRs another time. I'll just note that one good way to do source detection in future DRs might be to subtract all of the scarlet models from the previous DR and look for missing sources in the residuals in a type of difference imaging. Of course there are complications to this but it still might be worth looking into. And yes Jim Bosch , the scarlet outputs are guaranteed to be smaller by the PSF radius in all directions, which is significant for faint and point-like sources (which is the majority of objects). If there are no objections by Monday I'll adopt this RFC and begin to update DM-32436 accordingly to implement the single catalog deblender output with a renderCatalog method as outlined above in https://jira.lsstcorp.org/browse/RFC-825?focusedCommentId=500780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-500780 .

              People

              Assignee:
              fred3m Fred Moolekamp
              Reporter:
              fred3m Fred Moolekamp
              Watchers:
              Dan Taranu, Fred Moolekamp, Jim Bosch, Kian-Tat Lim, Robert Lupton
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.