Fix Version/s: None
Team:Data Release Production
Quoting Christopher Stephens [X] on
By the way, after looking at the quantum SQL a little more closely yesterday, I don't expect (but do hope for) big improvements to execution time until we do one of the following:
- materialize the views used in the SQL
- remove the distinct from the view definition and confirm Oracle properly rewrites the SQL for an efficient plan
- remove references to the views and rewrite the generated SQL to reference the base tables (my preference since it will make further optimizations easier though I understand others wanting to try one of the other approaches first)
Materializing the views is the hardest of these, as it would require much more complex logic in the code that updates the underlying tables.
Removing the DISTINCTs from view definitions is quite easy; I'll do that first on this ticket for some timing tests, and proceed to removing use of the views in the query generation - that's a bit trickier, but certainly still doable. I think some of the DISTINCTs are completely unnecessary (including what may be the most important one, in exposure_calibration_label_join), and ones that aren't actually shouldn't be appearing in the big query we're running at all - and if they appear in similar queries in the future, I think removing the DISTINICTs from them would only change the query results in a way the result-processing code wouldn't notice.
- is contained by
DM-17023 Refactor the Dimensions and query system
I've run some tests on both removing DISTINCT and switching to materialized views. All times are relative to the ~17m43s baseline of using the functionality from both
DM-19080 and DM-19851.
- Removing DISTINCT from all views: 33m29s.
- Removing DISTINCT from just exposure_calibration_label_join: 16m28s.
- Removing DISTINCT from just visit_detector_patch_join: 62m59s.
- Removing DISTINCT from exposure_calibration_label_join, materializing all other views (with DISTINCT): 17m11s.
I'm not sure all of the time differences here are significant (I think the uncertainty is at least +/- 30s), but it is clear that removing the DISTINCT from visit_detector_patch_join does cause problems - not in correctness, but in performance. And I think I understand why; without DISTINCT, visit_detector_patch_join has a lot of duplication (4642107 vs. 2957062 rows), and that corresponds pretty directly to a lot of duplicate output rows from the big query - so even if the plan is better, it's a lot more to return (and then process in Python). I also think it's an excellent candidate for a materialized view, as the tables it's based on will change pretty infrequently and in a very predictable way (in our current workflow, they're all entirely fixed after the bootstrapping process). So I'm a bit surprised that materializing it (and the other spatial-join views) didn't produce much in the way of an improvement - but maybe the point is just that this opens the door to other optimizations (more indexes?) we couldn't have performed if they were regular views.
On the other hand, I do think I've convinced myself that the DISTINCT in exposure_calibration_label_join is simply unnecessary, and given how it appears I do agree it'd be best to rewrite the SQL that references it to use the base tables instead. I'll take a look at that this week. But if we can determine independently that removing DISTINCT from that view does let the optimizer rewrite the big query but that doesn't actually help, that'd be good to know early so I don't waste time.
Finally, getting back to the spatial join views like visit_detector_patch_join, after seeing the row counts there, I think what we've really got to do is just shrink that. A back-of-the-envelope calculation says that the actual number of visit+detector and patch overlaps could be as small as 80k-160k (so a factor of 20-40 smaller than even the number of rows with DISTINCT) with different configuration. What's going on here is that we're rasterizing the visit+detector regions and the patch regions onto a common spherical pixelization (a "skypix" is a pixel in that rasterization) in order to identify which regions might overlap (and then filtering down to which actually do in Python when processing result rows). But that rasterization is just too coarse, so we're storing way too many possible overlaps that really aren't. Unfortunately we can't just change that configuration parameter without doing
DM-13990 first, but that might really help - though then I think we would really need a materialized DISTINCT view to take advantage of it, because then there would be an even bigger difference between the DISTINCT view and the view without DISTINCT.
Longer term, there are even more options: we could start using native DB spatial indexes, to the extent those are supported by the DBs we want to support (and regardless that will certainly involve lots of DB-specific solutions), or we could look for ways to populate a truly minimal visit_detector_patch_join table directly from the true overlaps - populating that table directly would be a lot like implementing a materialized view from Python (unless we can get the spherical region predictions into database UDFs, I suppose), and hence would add significant complexity there, but it may be worth it nonetheless.
Oracle would manage the materialization. "CREATE MATERIALIZED VIEW ... ON COMMIT REFRESH;" vs "CREATE VIEW".
12c also offers "real time" materialized views which allows SQL to make use of a stale materialized view + accounting for any missing/outdated data in base tables. I'm unsure about the requirements for that but it's an option in case ON COMMIT REFRESH is too much overhead but the MV is otherwise very useful.
If the DISTINCTs aren't absolutely necessary, we should get rid of them regardless.