As first reported in
DM-29453, an sqlite update to query flattening when subqueries have UNION ALL turned into a giant performance regression, which led to us pinning an old version of sqlite (< 3.35).
From Kian-Tat Lim the "easiest" way to get a repo to exhibit this problem is to run validation_data_hsc via https://github.com/lsst/faro/blob/master/bin/hsc_gen2_to_gen3.sh and then pipetask --long-log --log-level=sqlalchemy.engine=DEBUG run -j 1 -b validation_hsc_gen3/butler.yaml --register-dataset-types -p $FARO_DIR/pipelines/validate_drp_metrics_pipeline.yaml -d "" -o jenkins/validate_drp_metrics_all -i shared/valid_hsc_all,skymaps.
I ran several combinations of this query using sqlite 3.34.0 (our current pin) and sqlite 3.36.0 (the latest version). And as a first (successful!) test I essentially reverted https://github.com/lsst/daf_butler/commit/6c0e9abb3680c38da9f640b9dff38fa444eee1a7 putting back UNION instead of UNION ALL to avoid these terrifying sqlite code paths. A few of the tests I ran a couple times just to make sure it wasn't just load on lsst-devl01 or something. And here's what I found:
The next question was whether this was having an effect on postgres as well. Brock Brendal [X] was having trouble getting the following quantum graph query to finish within a week: pipetask qgraph -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813)" -b /repo/main/butler.yaml -i HSC/RC2/defaults --output HSC/runs/RC2/w_2021_30/
DM-31182 --output-run HSC/runs/RC2/w_2021_30/ DM-31182/20210809T192340Z -p /software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/obs_subaru/22.0.1-16-gf4dc08de+cb858f8305/pipelines/DRP.yaml#healSparsePropertyMaps,consolidateObjectTable,diffimDRP -q /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/ DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30_ DM-3118220210809T192340Z.qgraph --qgraph-dot /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/ DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30 DM-31182_20210809T192340Z.qgraph.dot
Turns out that replacing UNION ALL with UNION gets this query to run in 4.9 hours, rather than ... over a week and counting.