Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31583

Add interfaces to dataset summary table data and use it to improve query generation

    XMLWordPrintable

    Details

      Description

      Butler has been storing summaries of which dataset types and governor dimension values are present in each collection since DM-24939, which was merged almost a year ago in order to make it into the first stable butler schema.  But we still haven't actually used those summaries as intended.  On this ticket, I'll:

      • Provide some public interfaces for querying these summaries.  These will have both a fast path that will use the summary tables and a slow path that will run a new SELECT DISTINCT or SELECT with GROUP BY; we cannot guarantee that summary entries will be deleted when datasets are deleted, so while inaccuracies are rare, we need to be clear about what they represent (i.e. "this dataset type may be present in this collection").
      • Update the logic that generates subqueries for datasets to skip combinations that are known from the summary tables to have no matches.  This should dramatically reduce the number of UNION clauses we have in subqueries, in most cases.

      Open questions include

      • Where to put the new interfaces.  I'm leaning towards new methods on DatasetQueryResults, so the usage would be something like

      >>> registry.queryDatasets("calexp", collections="HSC/raw/all").any()
      False
      >>> registry.queryDatasets(..., collections="HSC/raw/all", fast=True).dataset_types().names
      {"raw"}

      • When to fetch the summary data (at Butler startup vs. as needed) and whether to cache it.

        Attachments

          Issue Links

            Activity

            Hide
            erykoff Eli Rykoff added a comment - - edited

            I took a look at the query generated by this ticket for my validation_data_hsc sqlite query that started the ball on DM-31548. And I can confirm that (a) the generated queries look a lot simpler with no UNION s (so good!) but (b) without the index that test query is still slow. With the index the query is fast.

            Show
            erykoff Eli Rykoff added a comment - - edited I took a look at the query generated by this ticket for my validation_data_hsc sqlite query that started the ball on DM-31548 . And I can confirm that (a) the generated queries look a lot simpler with no UNION s (so good!) but (b) without the index that test query is still slow. With the index the query is fast.
            Hide
            jbosch Jim Bosch added a comment -

            Thanks for the quick check.  The result that this doesn't remove the need for DM-31548 doesn't surprise me, but I figure it wasn't out of the question (though even if it did fix some such queries, I'd still think that new index was a good idea anyway).

            Merging tonight looks less likely, because some of my changes broke pipe_base, and while I'm pretty sure I could fix that quickly, it's a sign that it's better to take my time and do this well.  In particular, fixing pipe_base naively also removes a bit of diagnostic information that can be useful in debugging empty QGs, and with a bit more work on this ticket I can instead provide a lot more information for debugging empty QGs, so I'm going to do that.

            Show
            jbosch Jim Bosch added a comment - Thanks for the quick check.  The result that this doesn't remove the need for DM-31548 doesn't surprise me, but I figure it wasn't out of the question (though even if it did fix some such queries, I'd still think that new index was a good idea anyway). Merging tonight looks less likely, because some of my changes broke pipe_base, and while I'm pretty sure I could fix that quickly, it's a sign that it's better to take my time and do this well.  In particular, fixing pipe_base naively also removes a bit of diagnostic information that can be useful in debugging empty QGs, and with a bit more work on this ticket I can instead provide a lot more information for debugging empty QGs, so I'm going to do that.
            Hide
            jbosch Jim Bosch added a comment -

            Andy Salnikov, could you review these query-system improvements?

            Hopefully going commit-by-commit in daf_butler will make the story more clear.  Changes in pipe_base and ctrl_mpexec are small.

            There are two known issues here:

            • a YAML lint failure in daf_butler that I need to fix (see #dm-middleware-dev if you're curious, but otherwise don't worry about it - I'll figure out something);
            • the ctrl_mpexec change is undesirable on its own, though I think it's an acceptable tradeoff for the other improvements, especially if we plan to switch everybody to execution butler soon.  I'll raise this at the middleware meeting so more people can weigh in on it.

             

            Show
            jbosch Jim Bosch added a comment - Andy Salnikov , could you review these query-system improvements? Hopefully going commit-by-commit in  daf_butler will make the story more clear.  Changes in  pipe_base and  ctrl_mpexec are small. There are two known issues here: a YAML lint failure in daf_butler that I need to fix (see #dm-middleware-dev if you're curious, but otherwise don't worry about it - I'll figure out something); the ctrl_mpexec change is undesirable on its own, though I think it's an acceptable tradeoff for the other improvements, especially if we plan to switch everybody to execution butler soon.  I'll raise this at the middleware meeting so more people can weigh in on it.  
            Hide
            jbosch Jim Bosch added a comment - - edited

            I believe I've addressed both known issues:

            • I've updated the configuration for our YAML linter.
            • I've removed the ctrl_mpexec branch and PR, and instead rebased-away the daf_butler commit that makes findDataset shortcut via the summary caches.

            I'm still running Jenkins to check ci_hsc_gen3 in particular, which is what caught the problem with stale caches in execution originally: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/34984/pipeline

            Show
            jbosch Jim Bosch added a comment - - edited I believe I've addressed both known issues: I've updated the configuration for our YAML linter. I've removed the ctrl_mpexec branch and PR, and instead rebased-away the daf_butler commit that makes findDataset shortcut via the summary caches. I'm still running Jenkins to check ci_hsc_gen3 in particular, which is what caught the problem with stale caches in execution originally: https://ci.lsst.codes/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/34984/pipeline
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Looks good, though I'm not sure I understood every detail, of course, it's quite broad scope of changes. (and now we are definitely doomed )

            Show
            salnikov Andy Salnikov added a comment - - edited Looks good, though I'm not sure I understood every detail, of course, it's quite broad scope of changes. (and now we are definitely doomed )

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Andy Salnikov
              Watchers:
              Andy Salnikov, Eli Rykoff, Jim Bosch, Lee Kelvin, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.