Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-825

Storage concerns for deblender outputs

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      When DM-32436 is implemented ScarletDeblendTask will once again return two different output catalogs: one is a set catalog for each band using the scarlet models (convolved to the observed seeing) and the other is a set of catalogs for each band that uses the scarlet models to re-weight the observed images so that flux is conserved (similar to the SDSS and old HSC deblender outputs). Both sets of catalogs are identical except for their heavy footprints (even though the spansets for each footprint are identical).

      The reviewer for DM-32436 brought up the excellent point that this is redundant information, since the catalog with the scarlet models can be easily used to generate the conserved flux footprints and might not warrant doubling the output from the deblender. So I'm wondering if we should continue to just output the scarlet models and have downstream measurement tasks contain a config parameter to re-weight the deblender outputs, which they would do after the catalog has been loaded? The main counter-argument that I can imagine is that if we have to do this multiple times this might begin to become computationally expensive, so I'm curious if anyone knows how frequently we load catalogs from disk in the DRP pipeline, for example.

      Can anyone think of any other arguments for or against producing a single deblender model catalog in each band and doing the re-weighting when the data is loaded (if so desired)?

        Attachments

          Issue Links

            Activity

            Hide
            ktl Kian-Tat Lim added a comment -

            I don't have opinions on the details of this, but in general I'm against storing multiple copies of heavy footprints (especially in any final data products but even in intermediates).

            Show
            ktl Kian-Tat Lim added a comment - I don't have opinions on the details of this, but in general I'm against storing multiple copies of heavy footprints (especially in any final data products but even in intermediates).
            Hide
            fred3m Fred Moolekamp added a comment -

            That brings up a good point. In my opinion this may only be a temporary feature. I imagine that by the time we are making data releases for Rubin we'll have decided whether to use scarlet models or flux re-weighted models in our production pipeline, since a single pipeline has to choose one output vs the other. For now the scarlet models don't appear to be good enough and the flux re-weighted models are winning, but that might change. If we end up using the scarlet models in production then I don't think that we'll need to save the flux re-weighted models because they can be easily be recreated by downstream users who might prefer them for their science case. If we use the flux re-weighted models then we'll probably want to keep a copy of the scarlet models because some users might prefer them.

            Show
            fred3m Fred Moolekamp added a comment - That brings up a good point. In my opinion this may only be a temporary feature. I imagine that by the time we are making data releases for Rubin we'll have decided whether to use scarlet models or flux re-weighted models in our production pipeline, since a single pipeline has to choose one output vs the other. For now the scarlet models don't appear to be good enough and the flux re-weighted models are winning, but that might change. If we end up using the scarlet models in production then I don't think that we'll need to save the flux re-weighted models because they can be easily be recreated by downstream users who might prefer them for their science case. If we use the flux re-weighted models then we'll probably want to keep a copy of the scarlet models because some users might prefer them.
            Hide
            rhl Robert Lupton added a comment -

            I'm confused. If you're only fitting models (the first branch) then why don't you just put out the model parameters as measurements (e.g. scarletFlux_g), and always return the measurements on the flux-conserving children? If someone wants to know other parameters of your model, or the PSF-convolved realisation of the models, then can't that be done by providing utility functions?

            So one catalogue, with scarlet model parameters as a type of measurement?

            Show
            rhl Robert Lupton added a comment - I'm confused. If you're only fitting models (the first branch) then why don't you just put out the model parameters as measurements (e.g. scarletFlux_g), and always return the measurements on the flux-conserving children? If someone wants to know other parameters of your model, or the PSF-convolved realisation of the models, then can't that be done by providing utility functions? So one catalogue, with scarlet model parameters as a type of measurement?
            Hide
            fred3m Fred Moolekamp added a comment -

            The scarlet models are pixelated models similar to the "templates" in the SDSS-like implementation in the stack, not parametric models. So the catalog with the scarlet models would be the equivalent of storing the templates prior to the re-apportion flux step in meas_deblender, only in this case those initial models are iteratively fit so that in general they should result in better re-weighted footprints.

            Show
            fred3m Fred Moolekamp added a comment - The scarlet models are pixelated models similar to the "templates" in the SDSS-like implementation in the stack, not parametric models. So the catalog with the scarlet models would be the equivalent of storing the templates prior to the re-apportion flux step in meas_deblender, only in this case those initial models are iteratively fit so that in general they should result in better re-weighted footprints.
            Hide
            dtaranu Dan Taranu added a comment -

            Since I brought it up in DM-32436, my opinion mostly as a potential user of deblender outputs:

            • deepCoadd_calexp should have the fiducial deblended children used for measurements. If you say flux-reweighted (conserving) models are preferable, so be it. For subsequent parametric model fitting, that is indeed what we want (regardless of whether other algorithms perform better on the non-reweighted models) - fitting smooth models to other smooth models has proven problematic.
            • deepCoadd_deblendedModel should have the original Scarlet models if the calexp has the flux-reweighted versions.
            • deepCoadd_deblendedFlux doesn't seem necessary in this case. I think at some point you mentioned that one or the other would have models for isolated sources only, but if you're doing reweighting by default, those are also redundant, no?

            If you were envisioning a future where users actually regularly switch between using the flux-reweighted models or not, then perhaps it would make sense for the deepCoadd_deblended* corresponding to deepCoadd_calexp to just be a reference/symlink to that catalog, but I don't think that's possible in Gen3 (correct me if I'm wrong). Otherwise, if you think the default is only going to change rarely, then I think it's preferable to omit the redundant catalog.

            Show
            dtaranu Dan Taranu added a comment - Since I brought it up in DM-32436 , my opinion mostly as a potential user of deblender outputs: deepCoadd_calexp should have the fiducial deblended children used for measurements. If you say flux-reweighted (conserving) models are preferable, so be it. For subsequent parametric model fitting, that is indeed what we want (regardless of whether other algorithms perform better on the non-reweighted models) - fitting smooth models to other smooth models has proven problematic. deepCoadd_deblendedModel should have the original Scarlet models if the calexp has the flux-reweighted versions. deepCoadd_deblendedFlux doesn't seem necessary in this case. I think at some point you mentioned that one or the other would have models for isolated sources only, but if you're doing reweighting by default, those are also redundant, no? If you were envisioning a future where users actually regularly switch between using the flux-reweighted models or not, then perhaps it would make sense for the deepCoadd_deblended* corresponding to deepCoadd_calexp to just be a reference/symlink to that catalog, but I don't think that's possible in Gen3 (correct me if I'm wrong). Otherwise, if you think the default is only going to change rarely, then I think it's preferable to omit the redundant catalog.
            Hide
            jbosch Jim Bosch added a comment -

            I think my only strong opinion here is that we do not use the same dataset type to store different things; I'd prefer for downstream tasks to explicitly opt-in to which of these they want to use (by changing a connection configuration option) rather than implicitly getting whichever one the deblender produced.

            That suggests (but I think does not require) that we want the deblender task to be able to be configured to produce either or both, and that makes the the answer to this RFC something we can change in configuration, and I think that's a good thing.  Like Kian-Tat Lim , we should work hard to avoid persisting both in major productions (and certainly official DRs), by avoiding having different downstream tasks rely on having any particular one available, but I think it's a very useful thing to be able to do during development.

            Show
            jbosch Jim Bosch added a comment - I think my only strong opinion here is that we do not use the same dataset type to store different things; I'd prefer for downstream tasks to explicitly opt-in to which of these they want to use (by changing a connection configuration option) rather than implicitly getting whichever one the deblender produced. That suggests (but I think does not require) that we want the deblender task to be able to be configured to produce either or both, and that makes the the answer to this RFC something we can change in configuration, and I think that's a good thing.  Like Kian-Tat Lim , we should work hard to avoid persisting both in major productions (and certainly official DRs), by avoiding having different downstream tasks rely on having any particular one available, but I think it's a very useful thing to be able to do during development.
            Hide
            fred3m Fred Moolekamp added a comment -

            Dan Taranu brought up a good point in a discussion the other day, which is that right now we are storing only the convolved scarlet models. The problem with this is that if anyone wants to make measurements on the deconvolved models they are SOL, since those are discarded. One motivation for storing the scarlet deconvolved models is for future processing. For example, in DR2 we will likely reach faster convergence using the DR1 models for sources (when possible) since they will already be deconvolved and the galaxy models shouldn't change much except in the low SNR regions. As far as I can tell, the only task that loads the deblender models in DRP is lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, so it looks to me like convolutions and re-distributing flux based on scarlet models would only done once, if at all.

            So I want to propose the following:

            • The deblender will output a single catalog (deblendedModel) that has an entry for each component. Each component record will have a 1D SED array column (it already exists) and a 2D morphology HeavyFootprint that is the deconvolved scarlet model for the component. There will also be an entry in the catalog for each source, where a source is a parent of one or more components. The sources will have a parent that is the parent of the entire blend, and each component will have a parent that is the source containing that component.
            • A SourceCatalog.renderCatalog(Exposure exposure, bool convserveFlux) method will be created to convert the footprints in the deblendedModel catalog into a catalog for a given band, where exposure is the single band exposure to render the models into (needed for its PSF) and conserveFlux is a boolean value that determines whether or not flux should be re-distributed from the exposure using the models as templates. An isRendered property will be added to SourceCatalog to let the user know if the catalog has already been rendered and prevent multiple convolutions, and an isConserved parameter will let the user know if the catalog contains raw models or flux-conserved models.
            • The deblendedModel catalog will be the input for lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask, which will have config options to specify whether or not to convolveScarletModels or conserveFlux and will call deblenedModel.renderCatalog. Once the rendered catalog has been loaded into memory it can be passed to the measurement algorithms as necessary.

            Thoughts?

            Show
            fred3m Fred Moolekamp added a comment - Dan Taranu brought up a good point in a discussion the other day, which is that right now we are storing only the convolved scarlet models. The problem with this is that if anyone wants to make measurements on the deconvolved models they are SOL, since those are discarded. One motivation for storing the scarlet deconvolved models is for future processing. For example, in DR2 we will likely reach faster convergence using the DR1 models for sources (when possible) since they will already be deconvolved and the galaxy models shouldn't change much except in the low SNR regions. As far as I can tell, the only task that loads the deblender models in DRP is lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask , so it looks to me like convolutions and re-distributing flux based on scarlet models would only done once, if at all. So I want to propose the following: The deblender will output a single catalog ( deblendedModel ) that has an entry for each component. Each component record will have a 1D SED array column (it already exists) and a 2D morphology HeavyFootprint that is the deconvolved scarlet model for the component. There will also be an entry in the catalog for each source, where a source is a parent of one or more components. The sources will have a parent that is the parent of the entire blend, and each component will have a parent that is the source containing that component. A SourceCatalog.renderCatalog(Exposure exposure, bool convserveFlux) method will be created to convert the footprints in the deblendedModel catalog into a catalog for a given band, where exposure is the single band exposure to render the models into (needed for its PSF) and conserveFlux is a boolean value that determines whether or not flux should be re-distributed from the exposure using the models as templates. An isRendered property will be added to SourceCatalog to let the user know if the catalog has already been rendered and prevent multiple convolutions, and an isConserved parameter will let the user know if the catalog contains raw models or flux-conserved models. The deblendedModel catalog will be the input for lsst.pipe.tasks.multiBand.MeasureMergedCoaddSourcesTask , which will have config options to specify whether or not to convolveScarletModels or conserveFlux and will call deblenedModel.renderCatalog . Once the rendered catalog has been loaded into memory it can be passed to the measurement algorithms as necessary. Thoughts?
            Hide
            ktl Kian-Tat Lim added a comment -

            Using previous-DR outputs to "seed" the current DR is potentially fraught with provenance and reproducibility problems, e.g. if a DAC offers only the latest DR. It also has the potential for locking in incorrect local minima that might otherwise be avoided with more data. So we should be very careful before using this as a performance enhancement.

            Show
            ktl Kian-Tat Lim added a comment - Using previous-DR outputs to "seed" the current DR is potentially fraught with provenance and reproducibility problems, e.g. if a DAC offers only the latest DR. It also has the potential for locking in incorrect local minima that might otherwise be avoided with more data. So we should be very careful before using this as a performance enhancement.
            Hide
            fred3m Fred Moolekamp added a comment -

            I hadn't thought about the provenance issue but I did think about the local minima. I'm not too concerned about that because we don't have to use the exact model from the previous DR (and wouldn't want to for a number of reasons). But the key issue with initialization is that we are initializing a scarlet model in a partially deconvolved space with a set of observed images, so our initial images are always puffy. So my thought is that we can use the previous models with a sparsity constraint to correctly model the centers of sources as an effective deconvolution, with perhaps some noise on both the morphology and spectral arrays to allow them more flexibility to move out of a local minima. But I agree that this has to be done with care and compared to results using models initialized from scratch to make sure that we aren't biasing our results.

            Show
            fred3m Fred Moolekamp added a comment - I hadn't thought about the provenance issue but I did think about the local minima. I'm not too concerned about that because we don't have to use the exact model from the previous DR (and wouldn't want to for a number of reasons). But the key issue with initialization is that we are initializing a scarlet model in a partially deconvolved space with a set of observed images, so our initial images are always puffy. So my thought is that we can use the previous models with a sparsity constraint to correctly model the centers of sources as an effective deconvolution, with perhaps some noise on both the morphology and spectral arrays to allow them more flexibility to move out of a local minima. But I agree that this has to be done with care and compared to results using models initialized from scratch to make sure that we aren't biasing our results.
            Hide
            jbosch Jim Bosch added a comment - - edited

            I'm in favor of the new proposal, and those really feel like the only deblender output we should consider storing for delivery to science users (I think making deblender outputs a purely virtual dataset, i.e. something we always recompute from the image and catalog upon request, is also still in play).

            Like Kian-Tat Lim, I am very skeptical of using those across DRs; note that we won't even have the same set of proto-objects detected, so matching would be problematic, even if provenance and local minima were not (and think they are a problem from the perspective of exact reproducibility, even if they aren't necessarily a problem from the perspective of scientific biases). But storing the source of truth is important in its own right, and may be relevant for downstream pipeline steps, like projecting deblending to short-period coadds for forced photometry. And I bet those deblended models will be (or can be made to be) much smaller than any of the PSF-convolved derived HeavyFootprints.

            All that doesn't mean saving derived HeavyFootprints is out of the question - I think it's pretty likely that we'll want them as intermediates to avoid recomputing repeatedly in multiple downstream tasks. But I think we should try to think of that as an optimization in CPU/storage tradeoff space, so let's start without saving them and see how bad the CPU costs are.

            Show
            jbosch Jim Bosch added a comment - - edited I'm in favor of the new proposal, and those really feel like the only deblender output we should consider storing for delivery to science users (I think making deblender outputs a purely virtual dataset, i.e. something we always recompute from the image and catalog upon request, is also still in play). Like Kian-Tat Lim , I am very skeptical of using those across DRs; note that we won't even have the same set of proto-objects detected, so matching would be problematic, even if provenance and local minima were not (and think they are a problem from the perspective of exact reproducibility, even if they aren't necessarily a problem from the perspective of scientific biases). But storing the source of truth is important in its own right, and may be relevant for downstream pipeline steps, like projecting deblending to short-period coadds for forced photometry. And I bet those deblended models will be (or can be made to be) much smaller than any of the PSF-convolved derived HeavyFootprints. All that doesn't mean saving derived HeavyFootprints is out of the question - I think it's pretty likely that we'll want them as intermediates to avoid recomputing repeatedly in multiple downstream tasks. But I think we should try to think of that as an optimization in CPU/storage tradeoff space, so let's start without saving them and see how bad the CPU costs are.
            Hide
            fred3m Fred Moolekamp added a comment -

            We can talk more about processing across DRs another time. I'll just note that one good way to do source detection in future DRs might be to subtract all of the scarlet models from the previous DR and look for missing sources in the residuals in a type of difference imaging. Of course there are complications to this but it still might be worth looking into.

            And yes Jim Bosch, the scarlet outputs are guaranteed to be smaller by the PSF radius in all directions, which is significant for faint and point-like sources (which is the majority of objects).

            If there are no objections by Monday I'll adopt this RFC and begin to update DM-32436 accordingly to implement the single catalog deblender output with a renderCatalog method as outlined above in https://jira.lsstcorp.org/browse/RFC-825?focusedCommentId=500780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-500780.

            Show
            fred3m Fred Moolekamp added a comment - We can talk more about processing across DRs another time. I'll just note that one good way to do source detection in future DRs might be to subtract all of the scarlet models from the previous DR and look for missing sources in the residuals in a type of difference imaging. Of course there are complications to this but it still might be worth looking into. And yes Jim Bosch , the scarlet outputs are guaranteed to be smaller by the PSF radius in all directions, which is significant for faint and point-like sources (which is the majority of objects). If there are no objections by Monday I'll adopt this RFC and begin to update DM-32436 accordingly to implement the single catalog deblender output with a renderCatalog method as outlined above in https://jira.lsstcorp.org/browse/RFC-825?focusedCommentId=500780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-500780 .

              People

              Assignee:
              fred3m Fred Moolekamp
              Reporter:
              fred3m Fred Moolekamp
              Watchers:
              Dan Taranu, Fred Moolekamp, Jim Bosch, Kian-Tat Lim, Robert Lupton
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.