Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-22266

Adapt pipe_analysis to run on post-processing parquet tables in Gen2

    XMLWordPrintable

Details

    • Story
    • Status: Done
    • Resolution: Done
    • None
    • pipe_analysis
    • None
    • 24
    • DRP F19-6 (Nov), DRP S20-2 (Jan), DRP S20-3 (Feb), DRP S20-5 (Apr), DRP S20-6 (May), DRP F20-1 (June), DRP F20-2 (July), DRP F20-3 (Aug), DRP F20-5 (Oct)
    • Data Release Production
    • No

    Description

      As a step towards converting the pipe_analysis scripts to Gen3, adapt the scripts in pipe_analysis to work with the parquet tables produced in the pipe_tasks postprocessing.py step. The reading of afwTables will be deprecated eventually, but we still leave the ability to do so for compatibility with older datasets for which the postprocessing.py step was not run (so no parquet catalog tables exist). As such, the first step here is to convert all functionality to be run with pandas DataFrames (so, if reading in afwTable catalogs, convert them immediately to pandas DataFrames). Then add a function to read in the *_obj parquet tables and add a config parameter to control which tables are read in (default it to read the parquet catalogs).

      Attachments

        Issue Links

          Activity

            There was quite a bit of scope-creep of a sort in this ticket, but this is now all working.  The creeping mainly involved a lot of restructuring of functions to minimize code duplication, but also a new philosophy for the "match" catalogs.  Rather than just "denormalizing" the persisted match catalog (which includes only reference/source pairs that were actually used in the astrometric calibration during single frame measurement, SFM) I've implemented a new function, loadReferencesAndMatchToCatalog() which loads in the full reference records for the same search radius used in SFM, but performs a generic match to the source catalog, regardless of whether the objects were used in any calibration step. Culling of the source catalog based on flags can (and is by default) be performed prior to matching with the exception that any source used as a SFM calibration source will be retained for further (sub)selection and analysis.

            All five scripts have been run with both doReadParquetTables=True and False (the latter intended to be used only for backwards compatibility for repos that do not have the persisted parquet tables) and comparing a large fraction of them show they are indeed identical. One very minor exception is revealed in the comparison plots at the visit level when applying an external photometric calibration. The application of this calibration goes through different paths for the parquet vs no parquet table loading, and there are ~machine epsilon level differences in the photometry (i.e. at the 10^-11 mmag level). The plots also all compare virtually identically with the latest RC2 run (w_2020_42) with the expected exception of the match plots (see comments above). All of the plots can be perused at:
            https://lsst.ncsa.illinois.edu/~lauren/lauren/DM-22266/parq
            https://lsst.ncsa.illinois.edu/~lauren/lauren/DM-22266/noParq
            (and can be compared with the most recent RC2 run at https://lsst.ncsa.illinois.edu/~emorgan2/w_2020_42_qaplots/)
            As for performance, the following is output from sacct (prepended with the table format) reveals a typical speed-up by factors ranging from ~1.5 (visitAnalysis) to ~4 (colorAnalysis) and maximum memory usage decreased by factors ranging from ~2 (visitAnalysis) to ~10 (colorAnalysis), so fairly significant wins all around!

            $ sacct -u lauren --units=G --format=jobid,jobname,CPUTime,Elapsed,MaxRSS,State --starttime 2020-11-09T16:26
            

            table JobID JobName CPUTime Elapsed MaxRSS State
            noParq 3258 visitAnal+ 03:51:12 00:09:38   COMPLETED
            noParq 3258.batch batch 03:51:12 00:09:38 2.81G COMPLETED
            parq 3259 visitAnal+ 02:29:36 00:06:14   COMPLETED
            parq 3259.batch batch 02:29:36 00:06:14 1.28G COMPLETED
            noParq 3264 coaddAnal+ 19:33:36 00:48:54   COMPLETED
            noParq 3264.batch batch 19:33:36 00:48:54 75.04G COMPLETED
            parq 3265 coaddAnal+ 11:54:24 00:29:46   COMPLETED
            parq 3265.batch batch 11:54:24 00:29:46 17.81G COMPLETED
            noParq 3268 colorAnal+ 1-05:32:48 01:13:52   COMPLETED
            noParq 3268.batch batch 1-05:32:48 01:13:52 67.03G COMPLETED
            parq 3269 colorAnal+ 07:17:12 00:18:13   COMPLETED
            parq 3269.batch batch 07:17:12 00:18:13 6.66G COMPLETED
            lauren Lauren MacArthur added a comment - There was quite a bit of scope-creep of a sort in this ticket, but this is now all working.  The creeping mainly involved a lot of restructuring of functions to minimize code duplication, but also a new philosophy for the "match" catalogs.  Rather than just "denormalizing" the persisted match catalog (which includes only reference/source pairs that were actually used in the astrometric calibration during single frame measurement, SFM) I've implemented a new function,  loadReferencesAndMatchToCatalog() which loads in the full reference records for the same search radius used in SFM, but performs a generic match to the source catalog, regardless of whether the objects were used in any calibration step. Culling of the source catalog based on flags can (and is by default) be performed prior to matching with the exception that any source used as a SFM calibration source will be retained for further (sub)selection and analysis. All five scripts have been run with both doReadParquetTables=True and False (the latter intended to be used only for backwards compatibility for repos that do not have the persisted parquet tables) and comparing a large fraction of them show they are indeed identical. One very minor exception is revealed in the comparison plots at the visit level when applying an external photometric calibration. The application of this calibration goes through different paths for the parquet vs no parquet table loading, and there are ~machine epsilon level differences in the photometry (i.e. at the 10^-11 mmag level). The plots also all compare virtually identically with the latest RC2 run ( w_2020_42 ) with the expected exception of the match plots (see comments above). All of the plots can be perused at: https://lsst.ncsa.illinois.edu/~lauren/lauren/DM-22266/parq https://lsst.ncsa.illinois.edu/~lauren/lauren/DM-22266/noParq (and can be compared with the most recent RC2 run at https://lsst.ncsa.illinois.edu/~emorgan2/w_2020_42_qaplots/ ) As for performance, the following is output from sacct  (prepended with the table format) reveals a typical speed-up by factors ranging from ~1.5 (visitAnalysis) to ~4 (colorAnalysis) and maximum memory usage decreased by factors ranging from ~2 (visitAnalysis) to ~10 (colorAnalysis), so fairly significant wins all around! $ sacct - u lauren - - units = G - - format = jobid,jobname,CPUTime,Elapsed,MaxRSS,State - - starttime 2020 - 11 - 09T16 : 26 table JobID JobName CPUTime Elapsed MaxRSS State noParq 3258 visitAnal+ 03:51:12 00:09:38   COMPLETED noParq 3258.batch batch 03:51:12 00:09:38 2.81G COMPLETED parq 3259 visitAnal+ 02:29:36 00:06:14   COMPLETED parq 3259.batch batch 02:29:36 00:06:14 1.28G COMPLETED noParq 3264 coaddAnal+ 19:33:36 00:48:54   COMPLETED noParq 3264.batch batch 19:33:36 00:48:54 75.04G COMPLETED parq 3265 coaddAnal+ 11:54:24 00:29:46   COMPLETED parq 3265.batch batch 11:54:24 00:29:46 17.81G COMPLETED noParq 3268 colorAnal+ 1-05:32:48 01:13:52   COMPLETED noParq 3268.batch batch 1-05:32:48 01:13:52 67.03G COMPLETED parq 3269 colorAnal+ 07:17:12 00:18:13   COMPLETED parq 3269.batch batch 07:17:12 00:18:13 6.66G COMPLETED

            Would you mind giving this a look over?  It's quite a lot of churn, but since it still lives in lsst-dm, you need not feel compelled to look over all changes (but if you choose to, any comments will be most welcomed and addressed!), and you can consider the plots generated as "evidence" of success. I am mostly hoping to have your keen eye on my handling of the pandas DataFrames. I am totally new to them, so may have gone astray in my implementation. The main commits involved are:
            Adapt use of "schemas" to work with DataFrames
            Adapt all functions to work with DataFrames
            Add list of all columns to load from parquet tables
            Add functionality to use obj/source parquet tables
            Adapt colorAnalysis to read in parquet tables

            lauren Lauren MacArthur added a comment - Would you mind giving this a look over?  It's quite a lot of churn, but since it still lives in lsst-dm , you need not feel compelled to look over all changes (but if you choose to, any comments will be most welcomed and addressed!), and you can consider the plots generated as "evidence" of success. I am mostly hoping to have your keen eye on my handling of the pandas DataFrames. I am totally new to them, so may have gone astray in my implementation. The main commits involved are: Adapt use of "schemas" to work with DataFrames Adapt all functions to work with DataFrames Add list of all columns to load from parquet tables Add functionality to use obj/source parquet tables Adapt colorAnalysis to read in parquet tables

            Sorry for taking so long on this. 

            I've made a number of comments on GH after looking at the commits you linked to above.  In addition to a number of minor comments/suggestions, my main question is whether the readParquetTables and/or readCatalogs methods implementation could be de-duplicated between coadd and color.

            tmorton Tim Morton [X] (Inactive) added a comment - Sorry for taking so long on this.  I've made a number of comments on GH after looking at the commits you linked to above.  In addition to a number of minor comments/suggestions, my main question is whether the  readParquetTables and/or readCatalogs methods implementation could be de-duplicated between coadd and color.

            Thanks so much for all the great feedback Tim!  I will definitely go through all of your comments and try to implement your suggestions (I agree that optimization may not be an urgent priority here, but I do think there is much benefit for me to get ever more savvy with my dataFrame-ing).

            As for your main question, it's probably possible, but we've been trying to keep colorAnalysis.py as non-dependent as possible as it is our canary for the Gen3 conversion.

            lauren Lauren MacArthur added a comment - Thanks so much for all the great feedback Tim!  I will definitely go through all of your comments and try to implement your suggestions (I agree that optimization may not be an urgent priority here, but I do think there is much benefit for me to get ever more savvy with my dataFrame-ing). As for your main question, it's probably possible, but we've been trying to keep colorAnalysis.py as non-dependent as possible as it is our canary for the Gen3 conversion.

            Makes sense! If there's a good reason to keep the color implementation separate, then no need to change that; it was just something that stuck out to me.

            tmorton Tim Morton [X] (Inactive) added a comment - Makes sense! If there's a good reason to keep the color implementation separate, then no need to change that; it was just something that stuck out to me.

            Allrighty...I think I've addressed & implemented all your great suggestions & comments.  Thank you so much for such a detailed review of a less-than-fun review request.  I've rerun all the scripts to make sure nothing in the plots changed from their previous iteration and all indeed looks identical.  Merged and done!

            lauren Lauren MacArthur added a comment - Allrighty...I think I've addressed & implemented all your great suggestions & comments.  Thank you so much for such a detailed review of a less-than-fun review request.  I've rerun all the scripts to make sure nothing in the plots changed from their previous iteration and all indeed looks identical.  Merged and done!

            People

              lauren Lauren MacArthur
              lauren Lauren MacArthur
              Tim Morton [X] (Inactive)
              Jim Bosch, John Parejko, Lauren MacArthur, Nate Lust, Sophie Reed, Tim Morton [X] (Inactive), Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Jenkins

                  No builds found.