My review feedback here is mostly focused on some high-level, “zoom-out” points. I understand that "code review" is not supposed to be "design review"; nevertheless, my main comments are indeed design-like comments.
- I think the stack shouldn’t have multiple different “functor” implementations. If it is crucial to make functors configurable, I think we should make them configurable in pipe.tasks.functors, rather than create a separate analysis_drp world of functors. This presumably connects to the currently unused configStructField.py and plans for that.
- It would be good to make use of the fact that the dataframe storage format is parquet---that is, not to read in whole tract-level tables of ~100 or more columns for every plot, and then only use a handful of those columns. To do this, you should be able to use the `deferLoad=True` keyword in the Input, and then treat that input connection as a DeferredDatasetHandle, requiring an explicit .get(columns=columns) when you know what columns you need (see details on the GH review comments). This also could play nicely with using pipe.tasks.Functors, which can take DeferredDatasetHandles as arguments, if desired (or the necessary columns can be explicitly collected from the functors before loading, since functors know what columns they will need).
- I like the idea of the “Selector”; I think this should be an extension of pipe.tasks.Functor (again, details and some more specific suggestions in GH review).
- Possibly be wary of “over-configuration”? An example is making the x and y columns able to be arbitrary computations for SkyPlotTask. In what use case would this be helpful? In my expectation, these tasks will mostly be used by making a big standard pipeline to make all the plots we want to typically inspect, and then not really touching that often at all, so it’s not clear to me yet where the balance of allowing-for-user-customization-possibilities vs. direct simplicity should be.
- The analysis_drp plotting tasks should look forward to a world where there also exists on-demand dashboard plotting, drill-down exploration dashboard, etc. To this end, I really like that the actual functions that do the plotting only fundamentally require an input dataframe and some simple arguments. I would actually suggest even further code factorization that puts the plotting functions themselves into separate functions that could be called from outside the task, and have the task methods call those functions. Having experimented with this suggestion already, it seems like this can be done relatively easily after the fact, so no need to do this factorization in this ticket, though perhaps good to keep this idea in mind.
I make a lot of fairly specific suggestions in the GH review comments related to these points. While I suppose it would be OK to forge ahead without hashing out the details of the functor and DeferredDatasetHandle issues, it seems like tackling these ideas early on in this first plotting task is better than trying to go back and change things later if we eventually decide we want to.
The plots replicating the gen 2 color analysis plots are included here. This code is designed to be generic though so the same task can be reused easily in many different pipelines.