Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Labels:
-
Story Points:2
-
Epic Link:
-
Team:Data Release Production
-
Urgent?:No
Description
As first reported in DM-29453, an sqlite update to query flattening when subqueries have UNION ALL turned into a giant performance regression, which led to us pinning an old version of sqlite (< 3.35).
From Kian-Tat Lim the "easiest" way to get a repo to exhibit this problem is to run validation_data_hsc via https://github.com/lsst/faro/blob/master/bin/hsc_gen2_to_gen3.sh and then pipetask --long-log --log-level=sqlalchemy.engine=DEBUG run -j 1 -b validation_hsc_gen3/butler.yaml --register-dataset-types -p $FARO_DIR/pipelines/validate_drp_metrics_pipeline.yaml -d "" -o jenkins/validate_drp_metrics_all -i shared/valid_hsc_all,skymaps.
I ran several combinations of this query using sqlite 3.34.0 (our current pin) and sqlite 3.36.0 (the latest version). And as a first (successful!) test I essentially reverted https://github.com/lsst/daf_butler/commit/6c0e9abb3680c38da9f640b9dff38fa444eee1a7 putting back UNION instead of UNION ALL to avoid these terrifying sqlite code paths. A few of the tests I ran a couple times just to make sure it wasn't just load on lsst-devl01 or something. And here's what I found:
3.34 + UNION ALL: 7.98 min
|
3.36 + UNION ALL: 40 min
|
3.34 + UNION: 7.5 min; 7.5 min
|
3.36 + UNION: 7.2 min; 7.3 min
|
The next question was whether this was having an effect on postgres as well. Brock Brendal [X] was having trouble getting the following quantum graph query to finish within a week: pipetask qgraph -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813)" -b /repo/main/butler.yaml -i HSC/RC2/defaults --output HSC/runs/RC2/w_2021_30/DM-31182 --output-run HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z -p /software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/obs_subaru/22.0.1-16-gf4dc08de+cb858f8305/pipelines/DRP.yaml#healSparsePropertyMaps,consolidateObjectTable,diffimDRP -q /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30_DM-3118220210809T192340Z.qgraph --qgraph-dot /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30DM-31182_20210809T192340Z.qgraph.dot
Turns out that replacing UNION ALL with UNION gets this query to run in 4.9 hours, rather than ... over a week and counting.
It looks like these UNIONs get used in various JOINs in the large query, so it's possible that what I believe to be the original goal of trying to optimize the UNION part by using ALL when we know there are no duplicates (so the database can just copy rows without an explicit de-duplication check or pass) is actually making the JOIN part worse by not allowing the database optimizer to assume that there are no duplicates.