Fix Version/s: None
Sprint:DRP F19-6 (Nov), DRP S20-2 (Jan), DRP S20-3 (Feb), DRP S20-5 (Apr), DRP S20-6 (May), DRP F20-1 (June), DRP F20-2 (July), DRP F20-3 (Aug), DRP F20-5 (Oct)
Team:Data Release Production
As a step towards converting the pipe_analysis scripts to Gen3, adapt the scripts in pipe_analysis to work with the parquet tables produced in the pipe_tasks postprocessing.py step. The reading of afwTables will be deprecated eventually, but we still leave the ability to do so for compatibility with older datasets for which the postprocessing.py step was not run (so no parquet catalog tables exist). As such, the first step here is to convert all functionality to be run with pandas DataFrames (so, if reading in afwTable catalogs, convert them immediately to pandas DataFrames). Then add a function to read in the *_obj parquet tables and add a config parameter to control which tables are read in (default it to read the parquet catalogs).
Would you mind giving this a look over? It's quite a lot of churn, but since it still lives in lsst-dm, you need not feel compelled to look over all changes (but if you choose to, any comments will be most welcomed and addressed!), and you can consider the plots generated as "evidence" of success. I am mostly hoping to have your keen eye on my handling of the pandas DataFrames. I am totally new to them, so may have gone astray in my implementation. The main commits involved are:
Adapt use of "schemas" to work with DataFrames
Adapt all functions to work with DataFrames
Add list of all columns to load from parquet tables
Add functionality to use obj/source parquet tables
Adapt colorAnalysis to read in parquet tables
Sorry for taking so long on this.
I've made a number of comments on GH after looking at the commits you linked to above. In addition to a number of minor comments/suggestions, my main question is whether the readParquetTables and/or readCatalogs methods implementation could be de-duplicated between coadd and color.
Thanks so much for all the great feedback Tim! I will definitely go through all of your comments and try to implement your suggestions (I agree that optimization may not be an urgent priority here, but I do think there is much benefit for me to get ever more savvy with my dataFrame-ing).
As for your main question, it's probably possible, but we've been trying to keep colorAnalysis.py as non-dependent as possible as it is our canary for the Gen3 conversion.
Makes sense! If there's a good reason to keep the color implementation separate, then no need to change that; it was just something that stuck out to me.
Allrighty...I think I've addressed & implemented all your great suggestions & comments. Thank you so much for such a detailed review of a less-than-fun review request. I've rerun all the scripts to make sure nothing in the plots changed from their previous iteration and all indeed looks identical. Merged and done!
There was quite a bit of scope-creep of a sort in this ticket, but this is now all working. The creeping mainly involved a lot of restructuring of functions to minimize code duplication, but also a new philosophy for the "match" catalogs. Rather than just "denormalizing" the persisted match catalog (which includes only reference/source pairs that were actually used in the astrometric calibration during single frame measurement, SFM) I've implemented a new function, loadReferencesAndMatchToCatalog() which loads in the full reference records for the same search radius used in SFM, but performs a generic match to the source catalog, regardless of whether the objects were used in any calibration step. Culling of the source catalog based on flags can (and is by default) be performed prior to matching with the exception that any source used as a SFM calibration source will be retained for further (sub)selection and analysis.
All five scripts have been run with both doReadParquetTables=True and False (the latter intended to be used only for backwards compatibility for repos that do not have the persisted parquet tables) and comparing a large fraction of them show they are indeed identical. One very minor exception is revealed in the comparison plots at the visit level when applying an external photometric calibration. The application of this calibration goes through different paths for the parquet vs no parquet table loading, and there are ~machine epsilon level differences in the photometry (i.e. at the 10^-11 mmag level). The plots also all compare virtually identically with the latest RC2 run (w_2020_42) with the expected exception of the match plots (see comments above). All of the plots can be perused at:
(and can be compared with the most recent RC2 run at https://lsst.ncsa.illinois.edu/~emorgan2/w_2020_42_qaplots/)
As for performance, the following is output from sacct (prepended with the table format) reveals a typical speed-up by factors ranging from ~1.5 (visitAnalysis) to ~4 (colorAnalysis) and maximum memory usage decreased by factors ranging from ~2 (visitAnalysis) to ~10 (colorAnalysis), so fairly significant wins all around!