Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27027

Enable persistence of "source" parquet tables in obs_subaru

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: obs_subaru
    • Labels:
      None
    • Story Points:
      1
    • Sprint:
      DRP F20-3 (Aug)
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      In DM-22266, the pipe_analysis scripts are being converted to have the option of reading in the parquet files now being persisted via pipe_tasks's  postprocessing.py script. For the coadds, we are reading in the deepCoadd_obj datasets (added on DM-13770), which consist of a single parquet file per patch that is a merge of the deepCoadd_meas, deepCoadd_forced_src and deepCoadd_ref tables for all filters for a given dataset. As such, these all follow the same column naming conventions as those in their afwTable equivalents (i.e. they do not follow the DDPD-ified naming conventions; those are persisted as the objectTable dataset). For the visit level qa analysis in pipe_analysis, we would like to follow a similar pattern, but using the parquet equivalent of the src catalogs. The dataset for these is source (added as part of DM-24062). However, the default in singleFrameDriver.py is to NOT persist these nor the DPDD-ified sourceTable versions. The latter got a config override in obs_subaru, so they do get persisted for HSC processing. While we may eventually move to only using the DPDD-ified tables, in the interim, having the parquet tables with the original column names is desired (especially for maintaining the ability to read in the afwTables for older repos that didn’t get the postprocessing.py step, so have no parquet output).

      As such, the doSaveWideSourceTable config in singleFrameDriver.py, which currently defaults to False, will be also be overridden in obs_subaru to True so that these visit-level src-like parquet files will be available for future RC2 processing runs.

      It will be for a future decision (and RFC!) whether any of the defaults should be changed in singleFrameDriver.py itself and/or if we move to using the DPDD-ified versions only.

        Attachments

          Issue Links

            Activity

            Hide
            yusra Yusra AlSayyad added a comment -

            This is just config change right?

            Show
            yusra Yusra AlSayyad added a comment - This is just config change right?
            Hide
            lauren Lauren MacArthur added a comment -

            Correct.  In fact, just an override being added to obs_subaru, so only HSC processing will be affected.

            Show
            lauren Lauren MacArthur added a comment - Correct.  In fact, just an override being added to obs_subaru , so only HSC processing will be affected.
            Hide
            lauren Lauren MacArthur added a comment -

            Having run the following:

            singleFrameDriver.py /datasets/hsc/repo --calib /datasets/hsc/repo/CALIB/ --rerun private/lauren/DM-27027 --batch-type slurm --mpiexec='-bind-to socket' --job sfd_DM-27027 --id visit=1228 filter=HSC-I --cores 24 --time 900
            

            I can confirm that the source files are now being persisted, e.g.

            In [1]: import lsst.daf.persistence as dafPersist
            In [2]: rootDir = "/datasets/hsc/repo/rerun/private/lauren/DM-27027"
            In [3]: butler = dafPersist.Butler(rootDir) 
            In [4]: dataId = {"visit": 1228, "filter": "HSC-I", "ccd": 49}
            In [5]: source = butler.get("source", dataId=dataId)
            In [6]: source = source.toDataFrame()
            In [7]: source.columns                                                                         
            Out[7]: 
            Index(['coord_ra', 'coord_dec', 'parent', 'calib_detected',
                   'calib_psf_candidate', 'calib_psf_used', 'calib_psf_reserved',
                   'deblend_nChild', 'deblend_deblendedAsPsf', 'deblend_psfCenter_x',
                   ...
                   'ext_photometryKron_KronFlux_apCorr',
                   'ext_photometryKron_KronFlux_apCorrErr',
                   'ext_photometryKron_KronFlux_flag_apCorr',
                   'base_ClassificationExtendedness_value',
                   'base_ClassificationExtendedness_flag', 'base_FootprintArea_value',
                   'calib_astrometry_used', 'calib_photometry_used',
                   'calib_photometry_reserved', 'ccdVisitId'],
                  dtype='object', length=396)
            

            Show
            lauren Lauren MacArthur added a comment - Having run the following: singleFrameDriver.py / datasets / hsc / repo - - calib / datasets / hsc / repo / CALIB / - - rerun private / lauren / DM - 27027 - - batch - type slurm - - mpiexec = '-bind-to socket' - - job sfd_DM - 27027 - - id visit = 1228 filter = HSC - I - - cores 24 - - time 900 I can confirm that the source files are now being persisted, e.g. In [ 1 ]: import lsst.daf.persistence as dafPersist In [ 2 ]: rootDir = "/datasets/hsc/repo/rerun/private/lauren/DM-27027" In [ 3 ]: butler = dafPersist.Butler(rootDir) In [ 4 ]: dataId = { "visit" : 1228 , "filter" : "HSC-I" , "ccd" : 49 } In [ 5 ]: source = butler.get( "source" , dataId = dataId) In [ 6 ]: source = source.toDataFrame() In [ 7 ]: source.columns Out[ 7 ]: Index([ 'coord_ra' , 'coord_dec' , 'parent' , 'calib_detected' , 'calib_psf_candidate' , 'calib_psf_used' , 'calib_psf_reserved' , 'deblend_nChild' , 'deblend_deblendedAsPsf' , 'deblend_psfCenter_x' , ... 'ext_photometryKron_KronFlux_apCorr' , 'ext_photometryKron_KronFlux_apCorrErr' , 'ext_photometryKron_KronFlux_flag_apCorr' , 'base_ClassificationExtendedness_value' , 'base_ClassificationExtendedness_flag' , 'base_FootprintArea_value' , 'calib_astrometry_used' , 'calib_photometry_used' , 'calib_photometry_reserved' , 'ccdVisitId' ], dtype = 'object' , length = 396 )
            Hide
            lauren Lauren MacArthur added a comment - - edited

            Would you mind giving this a look? I'm pinging you since I also added the override to processCcdWithFakesDrive.py config file in obs_subaru and wanted to make sure you're on board with that.

            Jenkins is green.

            Show
            lauren Lauren MacArthur added a comment - - edited Would you mind giving this a look? I'm pinging you since I also added the override to processCcdWithFakesDrive.py config file in obs_subaru and wanted to make sure you're on board with that. Jenkins is green .
            Hide
            sophiereed Sophie Reed added a comment -

            Sounds good to me.

            Show
            sophiereed Sophie Reed added a comment - Sounds good to me.
            Hide
            lauren Lauren MacArthur added a comment -

            Thanks Sophie!  I'm sufficiently paranoid and ran another Jenkins just in case. It passed, so merged to master.

            Show
            lauren Lauren MacArthur added a comment - Thanks Sophie!  I'm sufficiently paranoid and ran another Jenkins just in case. It passed, so merged to master.

              People

              Assignee:
              lauren Lauren MacArthur
              Reporter:
              lauren Lauren MacArthur
              Reviewers:
              Sophie Reed
              Watchers:
              Eric Morganson [X] (Inactive), Lauren MacArthur, Sophie Reed, Tim Morton [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.