Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31429

UNION ALL in subqueries is causing catastrophic database pessimizations

    XMLWordPrintable

    Details

      Description

      As first reported in DM-29453, an sqlite update to query flattening when subqueries have UNION ALL turned into a giant performance regression, which led to us pinning an old version of sqlite (< 3.35).

      From Kian-Tat Lim the "easiest" way to get a repo to exhibit this problem is to run validation_data_hsc via https://github.com/lsst/faro/blob/master/bin/hsc_gen2_to_gen3.sh and then pipetask --long-log --log-level=sqlalchemy.engine=DEBUG run -j 1 -b validation_hsc_gen3/butler.yaml --register-dataset-types -p $FARO_DIR/pipelines/validate_drp_metrics_pipeline.yaml -d "" -o jenkins/validate_drp_metrics_all -i shared/valid_hsc_all,skymaps.
      I ran several combinations of this query using sqlite 3.34.0 (our current pin) and sqlite 3.36.0 (the latest version). And as a first (successful!) test I essentially reverted https://github.com/lsst/daf_butler/commit/6c0e9abb3680c38da9f640b9dff38fa444eee1a7 putting back UNION instead of UNION ALL to avoid these terrifying sqlite code paths. A few of the tests I ran a couple times just to make sure it wasn't just load on lsst-devl01 or something. And here's what I found:

      3.34 + UNION ALL: 7.98 min
      3.36 + UNION ALL: 40 min
      3.34 + UNION: 7.5 min; 7.5 min
      3.36 + UNION: 7.2 min; 7.3 min
      

      The next question was whether this was having an effect on postgres as well. Brock Brendal [X] was having trouble getting the following quantum graph query to finish within a week: pipetask qgraph -d "instrument='HSC' AND skymap='hsc_rings_v1' AND tract IN (9615, 9697, 9813)" -b /repo/main/butler.yaml -i HSC/RC2/defaults --output HSC/runs/RC2/w_2021_30/DM-31182 --output-run HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z -p /software/lsstsw/stack_20210520/stack/miniconda3-py38_4.9.2-0.6.0/Linux64/obs_subaru/22.0.1-16-gf4dc08de+cb858f8305/pipelines/DRP.yaml#healSparsePropertyMaps,consolidateObjectTable,diffimDRP -q /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30_DM-3118220210809T192340Z.qgraph --qgraph-dot /scratch/brendal4/bps-gen3-rc2/submit/HSC/runs/RC2/w_2021_30/DM-31182/20210809T192340Z/HSC_runs_RC2_w_2021_30DM-31182_20210809T192340Z.qgraph.dot

      Turns out that replacing UNION ALL with UNION gets this query to run in 4.9 hours, rather than ... over a week and counting.

        Attachments

          Activity

          Hide
          ktl Kian-Tat Lim added a comment -

          It looks like these UNIONs get used in various JOINs in the large query, so it's possible that what I believe to be the original goal of trying to optimize the UNION part by using ALL when we know there are no duplicates (so the database can just copy rows without an explicit de-duplication check or pass) is actually making the JOIN part worse by not allowing the database optimizer to assume that there are no duplicates.

          Show
          ktl Kian-Tat Lim added a comment - It looks like these UNIONs get used in various JOINs in the large query, so it's possible that what I believe to be the original goal of trying to optimize the UNION part by using ALL when we know there are no duplicates (so the database can just copy rows without an explicit de-duplication check or pass) is actually making the JOIN part worse by not allowing the database optimizer to assume that there are no duplicates.
          Show
          erykoff Eli Rykoff added a comment - Jenkins is here: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/34776/pipeline PR is here: https://github.com/lsst/daf_butler/pull/560
          Hide
          ktl Kian-Tat Lim added a comment -

          The change, comment describing why it's there, and change note look fine. Assuming Jenkins passes, this is OK to merge.

          Show
          ktl Kian-Tat Lim added a comment - The change, comment describing why it's there, and change note look fine. Assuming Jenkins passes, this is OK to merge.

            People

            Assignee:
            erykoff Eli Rykoff
            Reporter:
            erykoff Eli Rykoff
            Reviewers:
            Kian-Tat Lim
            Watchers:
            Eli Rykoff, Kian-Tat Lim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.