Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33710

Output a single storage container from ScarletDeblendTask

    XMLWordPrintable

    Details

    • Story Points:
      27
    • Sprint:
      DRP S22A, DRP S22B
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      The peer review in DM-32436 triggered RFC-825, where it was decided that it would be best to export a catalog with only the scarlet models (SED and morphology) with the dimensions (tract, patch, skyMap)}, similar to the mergeDet catalog. This ticket is to implement that change, which will result in the creation of new catalogs and a utility function to convert those catalogs into a catalog in each band that is convolved with the appropriate PSF and (if desired) uses those models to re-distribute the flux from the observed images (calExps.

        Attachments

          Issue Links

            Activity

            Hide
            fred3m Fred Moolekamp added a comment -

            I guess that depends on which version of the models we use, right? It's true that if we use the flux re-distributed models (like we currently use) then we would need to calculate shapes and centroids in each band separately. However, if we use the scarlet models then the morphology is (by definition) the same in all bands, so we can calculate a "multi-band" centroid and shapes directly on the scarlet models to calculate those quantities. At some point it might be worth testing to see if the centroid and shapes calculated on those models can be used for forced photometry, even if the forced photometry is done on flux conserved models.

            In terms of this ticket, I think that I've fixed everything and addressed all of your comments. Successful Jenkins run after including peer review fixes: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36755/pipeline . I'll rebase and run Jenkins again before merging once the PR's are approved.

            Show
            fred3m Fred Moolekamp added a comment - I guess that depends on which version of the models we use, right? It's true that if we use the flux re-distributed models (like we currently use) then we would need to calculate shapes and centroids in each band separately. However, if we use the scarlet models then the morphology is (by definition) the same in all bands, so we can calculate a "multi-band" centroid and shapes directly on the scarlet models to calculate those quantities. At some point it might be worth testing to see if the centroid and shapes calculated on those models can be used for forced photometry, even if the forced photometry is done on flux conserved models. In terms of this ticket, I think that I've fixed everything and addressed all of your comments. Successful Jenkins run after including peer review fixes: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36755/pipeline . I'll rebase and run Jenkins again before merging once the PR's are approved.
            Hide
            fred3m Fred Moolekamp added a comment - - edited

            Ok, I think that last thing outstanding was the disk space for the new footprints. Looking at the deblender output from w_2022_20 it looks like the total size of all the deblendedFlux catalogs in the rc2_subset (one for each band) was 1.5G. For this PR the catalog was 13M, with another 99M for the scarlet models, for a total of 112M. So just the deblender outputs save an order of magnitude disk space, even with an inefficient file format like JSON.

            Since the meas and forced_src catalogs just have a lot of data columns and there is still an output in each band, there isn't as much of a savings but it's still substantial (1.9G before vs 787M in this PR for meas, 2.0G vs 343M for forced_src). I don't understand why there is a larger savings in forced_src, since presumably the heavy footprints take up the same amount of disk space.

            In terms of time to re-generate the footprints, it's a little complicated, but the short answer is that creating the PSF image is the most expensive part. When using %time in a jupyter notebook it takes ~15-30ms (with a mode ~20ms) to compute the PSF image for each blend. But after the PSF is loaded for a given coordinate, it appears that the PSF is cached so that creating the kernel image again takes only O(10^-6s). Creating the footprints ~linear, so that (after getting the PSF once to cache it and remove it from the timing) creating footprints (convolving each source, re-distributing the flux, and creating the HeavyFootprints) for a blend with 15 sources takes ~20ms (~1.4ms/source), while creating footprints for a blend with 2 sources is ~3.4ms (~1.1ms/source). So speeding up the calculation of the PSF image would go a long way, but we're still looking at under 2 minutes to generate footprints for a catalog with ~25k sources (see below).

            The last consideration is that now that the footprints and catalogs are stored separately, anytime a catalog is loaded by a downstream task that doesn't require footprints (which is basically everything downstream of forced photometry in drp_pipe) we save both the time and data transfer of not having to load the footprints, which should also be substantial and help recover the costs of the footprint creation in measurement and forced photometry.

            TL;DR

              Current pipeline This PR
            deblender output 1.5G 112M
            meas output 1.9G 787M
            forced output 2.0G 345M

            Timings for rc2_subset (~25k sources)

            action approx runtime
            psf generation/blend ~20ms
            footprint creation/source (after psf generation) ~1-2ms
            footprint creation/source (including psf generation) ~5ms
            load catalog before ~1.5s
            load and transform catalog with this PR ~2min*
            • This is also the total time added to the pipeline in order to incorporate this change, since this is only done in measurement and forced photometry, and it is already being done once after deblending in the current pipeline.
            Show
            fred3m Fred Moolekamp added a comment - - edited Ok, I think that last thing outstanding was the disk space for the new footprints. Looking at the deblender output from w_2022_20 it looks like the total size of all the deblendedFlux catalogs in the rc2_subset (one for each band) was 1.5G. For this PR the catalog was 13M, with another 99M for the scarlet models, for a total of 112M. So just the deblender outputs save an order of magnitude disk space, even with an inefficient file format like JSON. Since the meas and forced_src catalogs just have a lot of data columns and there is still an output in each band, there isn't as much of a savings but it's still substantial (1.9G before vs 787M in this PR for meas, 2.0G vs 343M for forced_src). I don't understand why there is a larger savings in forced_src, since presumably the heavy footprints take up the same amount of disk space. In terms of time to re-generate the footprints, it's a little complicated, but the short answer is that creating the PSF image is the most expensive part. When using %time in a jupyter notebook it takes ~15-30ms (with a mode ~20ms) to compute the PSF image for each blend. But after the PSF is loaded for a given coordinate, it appears that the PSF is cached so that creating the kernel image again takes only O(10^-6s). Creating the footprints ~linear, so that (after getting the PSF once to cache it and remove it from the timing) creating footprints (convolving each source, re-distributing the flux, and creating the HeavyFootprints) for a blend with 15 sources takes ~20ms (~1.4ms/source), while creating footprints for a blend with 2 sources is ~3.4ms (~1.1ms/source). So speeding up the calculation of the PSF image would go a long way, but we're still looking at under 2 minutes to generate footprints for a catalog with ~25k sources (see below). The last consideration is that now that the footprints and catalogs are stored separately, anytime a catalog is loaded by a downstream task that doesn't require footprints (which is basically everything downstream of forced photometry in drp_pipe ) we save both the time and data transfer of not having to load the footprints, which should also be substantial and help recover the costs of the footprint creation in measurement and forced photometry. TL;DR   Current pipeline This PR deblender output 1.5G 112M meas output 1.9G 787M forced output 2.0G 345M Timings for rc2_subset (~25k sources) action approx runtime psf generation/blend ~20ms footprint creation/source (after psf generation) ~1-2ms footprint creation/source (including psf generation) ~5ms load catalog before ~1.5s load and transform catalog with this PR ~2min* This is also the total time added to the pipeline in order to incorporate this change, since this is only done in measurement and forced photometry, and it is already being done once after deblending in the current pipeline.
            Hide
            jbosch Jim Bosch added a comment -

            At some point it might be worth testing to see if the centroid and shapes calculated on those models can be used for forced photometry, even if the forced photometry is done on flux conserved models.

            This is a very good idea for a lot of things - certainly for centroids and second-moment shapes it's at least as principled as anything else I can think of doing, and we should make plans to go do it.  It's probably not the way to go for galaxy model-fitting, but the goal is to make that its own multi-band step rather than a meas+ref+forced sequence anyway.  We're not particularly close to that, and I'm sure there are some single-band unforced measurements are of interest to certain science cases, so I'm skeptical we'll ever get rid of the unforced measurements entirely - but driving the forced photometry from scarlet-model measurements instead of merged unforced measurements is definitely worth pursuing.  The final blocker will probably be taking the plunge of dropping CModel in favor of a multi-band ngmix and/or multiprofit modeling PipelineTask, but we can get started before that.

             

            Great news on all the timing and storage costs!  Thanks for the detailed breakdown.  I'll go back and look at the final PRs now, and expect to hit Reviewed here shortly.

            Show
            jbosch Jim Bosch added a comment - At some point it might be worth testing to see if the centroid and shapes calculated on those models can be used for forced photometry, even if the forced photometry is done on flux conserved models. This is a very good idea for a lot of things - certainly for centroids and second-moment shapes it's at least as principled as anything else I can think of doing, and we should make plans to go do it.  It's probably not the way to go for galaxy model-fitting, but the goal is to make that its own multi-band step rather than a meas+ref+forced sequence anyway.  We're not particularly close to that, and I'm sure there are some single-band unforced measurements are of interest to certain science cases, so I'm skeptical we'll ever get rid of the unforced measurements entirely - but driving the forced photometry from scarlet-model measurements instead of merged unforced measurements is definitely worth pursuing.  The final blocker will probably be taking the plunge of dropping CModel in favor of a multi-band ngmix and/or multiprofit modeling PipelineTask, but we can get started before that.   Great news on all the timing and storage costs!  Thanks for the detailed breakdown.  I'll go back and look at the final PRs now, and expect to hit Reviewed here shortly.
            Hide
            jbosch Jim Bosch added a comment -

            Looks good, and as far as I can tell ready to merge after some squashing.

            Show
            jbosch Jim Bosch added a comment - Looks good, and as far as I can tell ready to merge after some squashing.
            Show
            fred3m Fred Moolekamp added a comment - Successful Jenkins build after rebase: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/36781/pipeline/

              People

              Assignee:
              fred3m Fred Moolekamp
              Reporter:
              fred3m Fred Moolekamp
              Reviewers:
              Jim Bosch
              Watchers:
              Fred Moolekamp, Jim Bosch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.