I've run some tests on both removing DISTINCT and switching to materialized views. All times are relative to the ~17m43s baseline of using the functionality from both
DM-19080 and DM-19851.
- Removing DISTINCT from all views: 33m29s.
- Removing DISTINCT from just exposure_calibration_label_join: 16m28s.
- Removing DISTINCT from just visit_detector_patch_join: 62m59s.
- Removing DISTINCT from exposure_calibration_label_join, materializing all other views (with DISTINCT): 17m11s.
I'm not sure all of the time differences here are significant (I think the uncertainty is at least +/- 30s), but it is clear that removing the DISTINCT from visit_detector_patch_join does cause problems - not in correctness, but in performance. And I think I understand why; without DISTINCT, visit_detector_patch_join has a lot of duplication (4642107 vs. 2957062 rows), and that corresponds pretty directly to a lot of duplicate output rows from the big query - so even if the plan is better, it's a lot more to return (and then process in Python). I also think it's an excellent candidate for a materialized view, as the tables it's based on will change pretty infrequently and in a very predictable way (in our current workflow, they're all entirely fixed after the bootstrapping process). So I'm a bit surprised that materializing it (and the other spatial-join views) didn't produce much in the way of an improvement - but maybe the point is just that this opens the door to other optimizations (more indexes?) we couldn't have performed if they were regular views.
On the other hand, I do think I've convinced myself that the DISTINCT in exposure_calibration_label_join is simply unnecessary, and given how it appears I do agree it'd be best to rewrite the SQL that references it to use the base tables instead. I'll take a look at that this week. But if we can determine independently that removing DISTINCT from that view does let the optimizer rewrite the big query but that doesn't actually help, that'd be good to know early so I don't waste time.
Finally, getting back to the spatial join views like visit_detector_patch_join, after seeing the row counts there, I think what we've really got to do is just shrink that. A back-of-the-envelope calculation says that the actual number of visit+detector and patch overlaps could be as small as 80k-160k (so a factor of 20-40 smaller than even the number of rows with DISTINCT) with different configuration. What's going on here is that we're rasterizing the visit+detector regions and the patch regions onto a common spherical pixelization (a "skypix" is a pixel in that rasterization) in order to identify which regions might overlap (and then filtering down to which actually do in Python when processing result rows). But that rasterization is just too coarse, so we're storing way too many possible overlaps that really aren't. Unfortunately we can't just change that configuration parameter without doing
DM-13990 first, but that might really help - though then I think we would really need a materialized DISTINCT view to take advantage of it, because then there would be an even bigger difference between the DISTINCT view and the view without DISTINCT.
Longer term, there are even more options: we could start using native DB spatial indexes, to the extent those are supported by the DBs we want to support (and regardless that will certainly involve lots of DB-specific solutions), or we could look for ways to populate a truly minimal visit_detector_patch_join table directly from the true overlaps - populating that table directly would be a lot like implementing a materialized view from Python (unless we can get the spherical region predictions into database UDFs, I suppose), and hence would add significant complexity there, but it may be worth it nonetheless.