Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30647

Compare the data products of the gen2 vs. gen3 w_2021_22 RC2 runs up to Single Frame Processing

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      4
    • Epic Link:
    • Sprint:
      DRP S21b
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      Following on from DM-29819 (and the fixes implemented for issues uncovered there), perform the same comparison the RC2 dataset run on both middleware platforms for the w_2021_22 processing runs.

        Attachments

          Issue Links

            Activity

            Hide
            lauren Lauren MacArthur added a comment - - edited

            I have compared the gen2 vs. gen3 outputs for "[almost] all the things"** as far as SFM data products go for every single exposure in the RC2.  As In DM-29819 and DM-28858, this includes:

            • image arrays (image, variance, mask planes)
            • photoCalib objects
            • PSFs
            • WCSs
            • every column in every row in the source tables

            [**at least one exception is the {{srcMatch}} catalogs, which have not been explicitly checked, but the matching itself is effectively checked in the {{src}} catalog check via the {{calib_*}} flags. Another is that I only specifically compared the afwTable {{src}} catalogs, but not the postprocess parquet source tables. I can create another ticket specifically for that if desired.]

            In case anyone wants to know exactly what these comparisons comprised, I have attached the (somewhat hacky and not meant for consumption) script I ran (and will also run on the DC2 w_2021_24 runs of DM-30730 & DM-30674).

            The only outstanding differences are as follows (full logs from my script are in /datasets/hsc/repo/rerun/private/lauren/w22_gen2_vs_gen3/logs:

            The gen3 run had 15 instances of failed SFM:

            compareSfd_gen2_vs_gen3_RC2_GAMA_G.sh:
                 No gen3 calexp found for HSC-G 26036 30
                 No gen3 calexp found for HSC-G 26048 69
            compareSfd_gen2_vs_gen3_RC2_GAMA_I.sh:
                 No gen3 calexp found for HSC-I 1290 17
            compareSfd_gen2_vs_gen3_RC2_GAMA_Y.sh:
                 No gen3 calexp found for HSC-Y 27032 41
            compareSfd_gen2_vs_gen3_RC2_VVDS_R.sh:
                 No gen3 calexp found for HSC-R 34640 69
            compareSfd_gen2_vs_gen3_RC2_VVDS_Z.sh:
                 No gen3 calexp found for HSC-Z 36498 47
            compareSfd_gen2_vs_gen3_RC2_COSMOS_G.sh:
                 No gen3 calexp found for HSC-G 11692 48
                 No gen3 calexp found for HSC-G 11706 97
            compareSfd_gen2_vs_gen3_RC2_COSMOS_Z.sh:
                 No gen3 calexp found for HSC-Z 17944 27
            compareSfd_gen2_vs_gen3_RC2_COSMOS_Y.log:
                 No gen3 calexp found for HSC-Y 354 95
                 No gen3 calexp found for HSC-Y 356 92
                 No gen3 calexp found for HSC-Y 1868 77
                 No gen3 calexp found for HSC-Y 22662 88
            compareSfd_gen2_vs_gen3_RC2_COSMOS_NB0921.log:
                 No gen3 calexp found for NB0921 23038 55
                 No gen3 calexp found for NB0921 23602 25
            

            I cross-matched these against Monika Adamow's list of 17 failed Quanta on DM-30365 and most of these are explained by the HSM issue of DM-30426 (the branches with the fix were setup for the gen2 run, so they don't fail there) except for the following:

            The following case did have a calexp, but no src catalog:

            compareSfd_gen2_vs_gen3_RC2_VVDS_R.sh:
                 No gen3 src catalog found for HSC-R 34714 22
            

            The log tells me:

            $ more /scratch/madamow/gen3_bps/submit/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/jobs/calibrate/97965_calibrate_34714_22.2366851.err
             
                raise RuntimeError(f"Registry inconsistency while checking for existing outputs:"
            RuntimeError: Registry inconsistency while checking for existing outputs: collection=HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z existingRefs=[DatasetRef(DatasetType('srcMatc
            hFull', {band, instrument, detector, physical_filter, visit_system, visit}, Catalog), {instrument: 'HSC', detector: 22, visit: 34714, ...}, id=16996657, run='HSC/runs/RC2/w_2021_2
            2/DM-30365/20210527T164119Z'), DatasetRef(DatasetType('calexp', {band, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: 'HSC', detector: 22, v
            isit: 34714, ...}, id=16996663, run='HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z')] missingRefs=[DatasetRef(DatasetType('calexpBackground', {band, instrument, detector, physi
            cal_filter, visit_system, visit}, Background), {instrument: 'HSC', detector: 22, visit: 34714, ...}), DatasetRef(DatasetType('srcMatch', {band, instrument, detector, physical_filt
            er, visit_system, visit}, Catalog), {instrument: 'HSC', detector: 22, visit: 34714, ...}), DatasetRef(DatasetType('src', {band, instrument, detector, physical_filter, visit_system
            , visit}, SourceCatalog), {instrument: 'HSC', detector: 22, visit: 34714, ...}), DatasetRef(DatasetType('calibrate_metadata', {band, instrument, detector, physical_filter, visit_s
            ystem, visit}, PropertySet), {instrument: 'HSC', detector: 22, visit: 34714, ...})]
            

            but there's also this log there saying:

            $ more /scratch/madamow/gen3_bps/submit/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/jobs/calibrate/97965_calibrate_34714_22.2239756.err
             
            lsst::afw::fits::FitsError: 'cfitsio error: couldn't create the named file (105) : Opening file '/repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatch/20150715/r/HS
            C-R/34714/srcMatch_HSC_r_HSC-R_34714_0_27_HSC_runs_RC2_w_2021_22_DM-30365_20210527T164119Z.fits' with mode 'w'
            cfitsio error stack:
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              failed to create new file (already exists?):
              /repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatch/20150715/r/
              HSC-R/34714/srcMatch_HSC_r_HSC-R_34714_0_27_HSC_runs_RC2_w_2021_22_DM-30365_2021
              0527T164119Z.fits
            ...
                raise RuntimeError(f"Failed to serialize dataset {ref} of type {type(inMemoryDataset)} "
            RuntimeError: Failed to serialize dataset srcMatch@{instrument: 'HSC', detector: 22, visit: 34714, ...}, sc=Catalog] (id=16996706) of type <class 'lsst.afw.table.BaseCatalog'> to 
            location file:///repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatch/20150715/r/HSC-R/34714/srcMatch_HSC_r_HSC-R_34714_0_27_HSC_runs_RC2_w_2021_22_DM-30365_2021052
            7T164119Z.fits
            

            Finally, the following didn't turn up as "missing" for me, but was in Monika's lis of failures:

            $ /scratch/madamow/gen3_bps/submit/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/jobs/calibrate/128553_calibrate_26032_52.2239777.err
            lsst::afw::fits::FitsError: 'cfitsio error: couldn't create the named file (105) : Opening file '/repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatchFull/20150325/
            g/HSC-G/26032/srcMatchFull_HSC_g_HSC-G_26032_1_14_HSC_runs_RC2_w_2021_22_DM-30365_20210527T164119Z.fits' with mode 'w'
            cfitsio error stack:
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              Warning: the following keyword does not conform to the HIERARCH convention
              HIERARCH AFW_TABLE_VERSION = 3
              failed to create new file (already exists?):
              /repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatchFull/2015032
              5/g/HSC-G/26032/srcMatchFull_HSC_g_HSC-G_26032_1_14_HSC_runs_RC2_w_2021_22_DM-30
              365_20210527T164119Z.fits
                raise RuntimeError(f"Failed to serialize dataset {ref} of type {type(inMemoryDataset)} "
            RuntimeError: Failed to serialize dataset srcMatchFull@{instrument: 'HSC', detector: 52, visit: 26032, ...}, sc=Catalog] (id=16996708) of type <class 'lsst.afw.table.BaseCatalog'>
             to location file:///repo/main/HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z/srcMatchFull/20150325/g/HSC-G/26032/srcMatchFull_HSC_g_HSC-G_26032_1_14_HSC_runs_RC2_w_2021_22_DM-3
            0365_20210527T164119Z.fits
            

            I suspect those latter two boil down to some "having to reprocess" certain quanta issues?

            Beyond that, the offset in the deblend_peakId for each and every src catalog first noted in DM-28858 still persists (I think Jim Bosch thought this was fixed, so it may be significant that it's not, but I don't think this value is used in any downstream processing, so is not likely to cause any issues...except perhaps the rare and unlikely case of someone doing their own analyses which make use of this column...)  So, from my perspective...gen2/gen3 parity is essentially achieved for all visit/ccd exposures comprising the RC2 dataset.

            Show
            lauren Lauren MacArthur added a comment - - edited I have compared the gen2 vs. gen3 outputs for " [almost] all the things"** as far as SFM data products go for every single exposure in the RC2.  As In DM-29819 and DM-28858 , this includes: image arrays (image, variance, mask planes) photoCalib objects PSFs WCSs every column in every row in the source tables [**at least one exception is the {{srcMatch}} catalogs, which have not been explicitly checked, but the matching itself is effectively checked in the {{src}} catalog check via the {{calib_*}} flags. Another is that I only specifically compared the afwTable {{src}} catalogs, but not the postprocess parquet source tables. I can create another ticket specifically for that if desired.] In case anyone wants to know exactly what these comparisons comprised, I have attached the (somewhat hacky and not meant for consumption) script I ran (and will also run on the DC2 w_2021_24 runs of DM-30730 & DM-30674 ). The only outstanding differences are as follows (full logs from my script are in /datasets/hsc/repo/rerun/private/lauren/w22_gen2_vs_gen3/logs : The gen3 run had 15 instances of failed SFM: compareSfd_gen2_vs_gen3_RC2_GAMA_G.sh: No gen3 calexp found for HSC - G 26036 30 No gen3 calexp found for HSC - G 26048 69 compareSfd_gen2_vs_gen3_RC2_GAMA_I.sh: No gen3 calexp found for HSC - I 1290 17 compareSfd_gen2_vs_gen3_RC2_GAMA_Y.sh: No gen3 calexp found for HSC - Y 27032 41 compareSfd_gen2_vs_gen3_RC2_VVDS_R.sh: No gen3 calexp found for HSC - R 34640 69 compareSfd_gen2_vs_gen3_RC2_VVDS_Z.sh: No gen3 calexp found for HSC - Z 36498 47 compareSfd_gen2_vs_gen3_RC2_COSMOS_G.sh: No gen3 calexp found for HSC - G 11692 48 No gen3 calexp found for HSC - G 11706 97 compareSfd_gen2_vs_gen3_RC2_COSMOS_Z.sh: No gen3 calexp found for HSC - Z 17944 27 compareSfd_gen2_vs_gen3_RC2_COSMOS_Y.log: No gen3 calexp found for HSC - Y 354 95 No gen3 calexp found for HSC - Y 356 92 No gen3 calexp found for HSC - Y 1868 77 No gen3 calexp found for HSC - Y 22662 88 compareSfd_gen2_vs_gen3_RC2_COSMOS_NB0921.log: No gen3 calexp found for NB0921 23038 55 No gen3 calexp found for NB0921 23602 25 I cross-matched these against Monika Adamow 's list of 17 failed Quanta on DM-30365 and most of these are explained by the HSM issue of DM-30426 (the branches with the fix were setup for the gen2 run, so they don't fail there) except for the following: The following case did have a calexp, but no src catalog: compareSfd_gen2_vs_gen3_RC2_VVDS_R.sh: No gen3 src catalog found for HSC - R 34714 22 The log tells me: $ more / scratch / madamow / gen3_bps / submit / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / jobs / calibrate / 97965_calibrate_34714_22 . 2366851.err   raise RuntimeError(f "Registry inconsistency while checking for existing outputs:" RuntimeError: Registry inconsistency while checking for existing outputs: collection = HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z existingRefs = [DatasetRef(DatasetType('srcMatc hFull ', {band, instrument, detector, physical_filter, visit_system, visit}, Catalog), {instrument: ' HSC ', detector: 22, visit: 34714, ...}, id=16996657, run=' HSC / runs / RC2 / w_2021_2 2 / DM - 30365 / 20210527T164119Z '), DatasetRef(DatasetType(' calexp ', {band, instrument, detector, physical_filter, visit_system, visit}, ExposureF), {instrument: ' HSC', detector: 22 , v isit: 34714 , ...}, id = 16996663 , run = 'HSC/runs/RC2/w_2021_22/DM-30365/20210527T164119Z' )] missingRefs = [DatasetRef(DatasetType( 'calexpBackground' , {band, instrument, detector, physi cal_filter, visit_system, visit}, Background), {instrument: 'HSC' , detector: 22 , visit: 34714 , ...}), DatasetRef(DatasetType( 'srcMatch' , {band, instrument, detector, physical_filt er, visit_system, visit}, Catalog), {instrument: 'HSC' , detector: 22 , visit: 34714 , ...}), DatasetRef(DatasetType( 'src' , {band, instrument, detector, physical_filter, visit_system , visit}, SourceCatalog), {instrument: 'HSC' , detector: 22 , visit: 34714 , ...}), DatasetRef(DatasetType( 'calibrate_metadata' , {band, instrument, detector, physical_filter, visit_s ystem, visit}, PropertySet), {instrument: 'HSC' , detector: 22 , visit: 34714 , ...})] but there's also this log there saying: $ more / scratch / madamow / gen3_bps / submit / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / jobs / calibrate / 97965_calibrate_34714_22 . 2239756.err   lsst::afw::fits::FitsError: 'cfitsio error: couldn' t create the named file ( 105 ) : Opening file ' / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatch / 20150715 / r / HS C - R / 34714 / srcMatch_HSC_r_HSC - R_34714_0_27_HSC_runs_RC2_w_2021_22_DM - 30365_20210527T164119Z .fits ' with mode ' w' cfitsio error stack: Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 failed to create new file (already exists?): / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatch / 20150715 / r / HSC - R / 34714 / srcMatch_HSC_r_HSC - R_34714_0_27_HSC_runs_RC2_w_2021_22_DM - 30365_2021 0527T164119Z .fits ... raise RuntimeError(f "Failed to serialize dataset {ref} of type {type(inMemoryDataset)} " RuntimeError: Failed to serialize dataset srcMatch@{instrument: 'HSC' , detector: 22 , visit: 34714 , ...}, sc = Catalog] ( id = 16996706 ) of type < class 'lsst.afw.table.BaseCatalog' > to location file : / / / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatch / 20150715 / r / HSC - R / 34714 / srcMatch_HSC_r_HSC - R_34714_0_27_HSC_runs_RC2_w_2021_22_DM - 30365_2021052 7T164119Z .fits Finally, the following didn't turn up as "missing" for me, but was in Monika's lis of failures: $ / scratch / madamow / gen3_bps / submit / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / jobs / calibrate / 128553_calibrate_26032_52 . 2239777.err lsst::afw::fits::FitsError: 'cfitsio error: couldn' t create the named file ( 105 ) : Opening file ' / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatchFull / 20150325 / g / HSC - G / 26032 / srcMatchFull_HSC_g_HSC - G_26032_1_14_HSC_runs_RC2_w_2021_22_DM - 30365_20210527T164119Z .fits ' with mode ' w' cfitsio error stack: Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 Warning: the following keyword does not conform to the HIERARCH convention HIERARCH AFW_TABLE_VERSION = 3 failed to create new file (already exists?): / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatchFull / 2015032 5 / g / HSC - G / 26032 / srcMatchFull_HSC_g_HSC - G_26032_1_14_HSC_runs_RC2_w_2021_22_DM - 30 365_20210527T164119Z .fits raise RuntimeError(f "Failed to serialize dataset {ref} of type {type(inMemoryDataset)} " RuntimeError: Failed to serialize dataset srcMatchFull@{instrument: 'HSC' , detector: 52 , visit: 26032 , ...}, sc = Catalog] ( id = 16996708 ) of type < class 'lsst.afw.table.BaseCatalog' > to location file : / / / repo / main / HSC / runs / RC2 / w_2021_22 / DM - 30365 / 20210527T164119Z / srcMatchFull / 20150325 / g / HSC - G / 26032 / srcMatchFull_HSC_g_HSC - G_26032_1_14_HSC_runs_RC2_w_2021_22_DM - 3 0365_20210527T164119Z .fits I suspect those latter two boil down to some "having to reprocess" certain quanta issues? Beyond that, the offset in the deblend_peakId  for each and every src catalog first noted in DM-28858 still persists (I think Jim Bosch thought this was fixed, so it may be significant that it's not, but I don't think this value is used in any downstream processing, so is not likely to cause any issues...except perhaps the rare and unlikely case of someone doing their own analyses which make use of this column...)  So, from my perspective...gen2/gen3 parity is essentially achieved for all visit/ccd exposures comprising the RC2 dataset.
            Hide
            lauren Lauren MacArthur added a comment -

            Could you give this a look when you get a chance?  Given my conclusion, I'm going to move on to the post-SFM gen2/gen3 comparisons for HSC regardless (as well as the SFM comparisons for DC2 runs), so no rush, but the minor "issues" noted above may be something you feel needs further investigation, so do give them a quick look when you have a spare moment.

            Show
            lauren Lauren MacArthur added a comment - Could you give this a look when you get a chance?  Given my conclusion, I'm going to move on to the post-SFM gen2/gen3 comparisons for HSC regardless (as well as the SFM comparisons for DC2 runs), so no rush, but the minor "issues" noted above may be something you feel needs further investigation, so do give them a quick look when you have a spare moment.
            Hide
            lauren Lauren MacArthur added a comment -

            Oh, and I also ran the compareVisitAnalysis.py script on a couple of visits (having all external calibrations turned off, so truly representative of SFM processing alone).  I'd post plots, but the are all the most boring of flatlines and had a 100% match rate, so really not worth looking at!

            Show
            lauren Lauren MacArthur added a comment - Oh, and I also ran the compareVisitAnalysis.py script on a couple of visits (having all external calibrations turned off, so truly representative of SFM processing alone).  I'd post plots, but the are all the most boring of flatlines and had a 100% match rate, so really not worth looking at!
            Hide
            jbosch Jim Bosch added a comment -

            I only specifically compared the afwTable src catalogs, but not the postprocess parquet source tables.

            Let's make sure we come back to this when we're ready to sign off on the algorithmic pieces being equivalent, but in the meantime I'm happy for you to focus on the algorithmic tasks and ignore these if your tooling is in better shape on the afw.table/FITS outputs.

            I suspect those latter two boil down to some "having to reprocess" certain quanta issues?

            I was aware of these already, and I'm not sure if they're temporary hardware-ish glitches, something in the middleware, or some combination of the two. If you see more of them, let us know over in #dm-middleware-dev, but otherwise don't worry about them. I hadn't seen the "registry inconsistency error" before, but I'm pretty sure it's just what happens downstream due to the cfitsio errors.

            On to the next comparison!

            Show
            jbosch Jim Bosch added a comment - I only specifically compared the afwTable src catalogs, but not the postprocess parquet source tables. Let's make sure we come back to this when we're ready to sign off on the algorithmic pieces being equivalent, but in the meantime I'm happy for you to focus on the algorithmic tasks and ignore these if your tooling is in better shape on the afw.table/FITS outputs. I suspect those latter two boil down to some "having to reprocess" certain quanta issues? I was aware of these already, and I'm not sure if they're temporary hardware-ish glitches, something in the middleware, or some combination of the two. If you see more of them, let us know over in #dm-middleware-dev, but otherwise don't worry about them. I hadn't seen the "registry inconsistency error" before, but I'm pretty sure it's just what happens downstream due to the cfitsio errors. On to the next comparison!
            Hide
            lauren Lauren MacArthur added a comment -

            Great, thanks Jim.  I've created DM-30817 for the postprocess data product comparisons, and I will definitely keep an eye out for the unusual gen3 errors noted above.

            Show
            lauren Lauren MacArthur added a comment - Great, thanks Jim.  I've created DM-30817 for the postprocess data product comparisons, and I will definitely keep an eye out for the unusual gen3 errors noted above.

              People

              Assignee:
              lauren Lauren MacArthur
              Reporter:
              lauren Lauren MacArthur
              Reviewers:
              Jim Bosch
              Watchers:
              Jim Bosch, Lauren MacArthur, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.