Fix Version/s: None
Sprint:DRP S21a (Dec Jan), DRP S21b
Team:Data Release Production
In the gen 2 version of pipe_analysis a number of plots across the scripts plot values at their sky positions. In this ticket we make the ones for color_analysis.
Also included is an example pipeline for testing and demonstration purposes.
A lot of this package is still in the development stages and I am expecting to make changes as more is added to the package to ensure cohesion.
The incantation that runs this:
pipetask run -b /project/hsc/gen3repo/rc2w42_ssw46/butler.yaml -o u/sr525/colorAnalysisTest2 -p skyPlotTest.yaml -d "instrument='HSC' and tract IN (9813) and skymap='hsc_rings_v1'" --register-dataset-types --replace-run --prune-replaced=purge
What worked for me to test this was the following:
pipetask run -b /project/hsc/gen3repo/rc2w02_ssw03/butler.yaml -i u/sr525/colorAnalysisTest3 -o u/tmorton/skyPlotTest -p skyPlotExample.yaml -d "instrument='HSC' and tract=9813 and skymap='hsc_rings_v1' and band='r'" --register-dataset-types
My review feedback here is mostly focused on some high-level, “zoom-out” points. I understand that "code review" is not supposed to be "design review"; nevertheless, my main comments are indeed design-like comments.
- I think the stack shouldn’t have multiple different “functor” implementations. If it is crucial to make functors configurable, I think we should make them configurable in pipe.tasks.functors, rather than create a separate analysis_drp world of functors. This presumably connects to the currently unused configStructField.py and plans for that.
- It would be good to make use of the fact that the dataframe storage format is parquet---that is, not to read in whole tract-level tables of ~100 or more columns for every plot, and then only use a handful of those columns. To do this, you should be able to use the `deferLoad=True` keyword in the Input, and then treat that input connection as a DeferredDatasetHandle, requiring an explicit .get(columns=columns) when you know what columns you need (see details on the GH review comments). This also could play nicely with using pipe.tasks.Functors, which can take DeferredDatasetHandles as arguments, if desired (or the necessary columns can be explicitly collected from the functors before loading, since functors know what columns they will need).
- I like the idea of the “Selector”; I think this should be an extension of pipe.tasks.Functor (again, details and some more specific suggestions in GH review).
- Possibly be wary of “over-configuration”? An example is making the x and y columns able to be arbitrary computations for SkyPlotTask. In what use case would this be helpful? In my expectation, these tasks will mostly be used by making a big standard pipeline to make all the plots we want to typically inspect, and then not really touching that often at all, so it’s not clear to me yet where the balance of allowing-for-user-customization-possibilities vs. direct simplicity should be.
- The analysis_drp plotting tasks should look forward to a world where there also exists on-demand dashboard plotting, drill-down exploration dashboard, etc. To this end, I really like that the actual functions that do the plotting only fundamentally require an input dataframe and some simple arguments. I would actually suggest even further code factorization that puts the plotting functions themselves into separate functions that could be called from outside the task, and have the task methods call those functions. Having experimented with this suggestion already, it seems like this can be done relatively easily after the fact, so no need to do this factorization in this ticket, though perhaps good to keep this idea in mind.
I make a lot of fairly specific suggestions in the GH review comments related to these points. While I suppose it would be OK to forge ahead without hashing out the details of the functor and DeferredDatasetHandle issues, it seems like tackling these ideas early on in this first plotting task is better than trying to go back and change things later if we eventually decide we want to.
Left comments in GH review (and repeating some of my comments here). Generally looks good. Biggest zoom-out point is that a lot of the pandas usage here doesn't really follow pandas idioms, resulting in I think a lot of extra code. I also don't super-like how the stars/galaxies thing is handled with 1's and 2's...I think I understand why you did it, and I don't think there's anything actually problematic with it, but it seems like it could be a lot cleaner and clearer to just have an action that returns a column with 'star' and 'galaxy' labels, and add this column to your catPlot. And then a bunch of smaller things.
The plots replicating the gen 2 color analysis plots are included here. This code is designed to be generic though so the same task can be reused easily in many different pipelines.