Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-31769

Add user control over dataset constraints in QG generation queries

    XMLWordPrintable

    Details

      Description

      We've come across a new QG generation problem that's catastrophically slow, as detailed on this thread.  Unfortunately neither DM-31548 nor DM-31583 resolve it, though they both help a bit, and I believe it's an important one for DRP and maybe DP0.2.

      The slowdown is in the big initial query, which in this case involves both a very constraining tract expression and a very large number of dataset subqueries (exacerbated by a large number of collections being searched); we need the query planner to start with that tract constraint, and only check for dataset existence within each tract.  I can manually rewrite the query to essentially force that behavior (or something like it; the 40min query I get back is still much slower than I want, but it's tolerable for now, and I'm bad at reading query plans), but doing that kind of rewrite all the time would make other QG generation queries catastrophically slow (because sometimes it's a dataset/collection subquery that has the most constraining power).

      So it seems we really need levers for users to control which dataset types (if any) enter in that big initial query, and because we already perform follow-up queries for all input dataset types, there's no need for query-rewriting; we "just" need to stop assuming that those follow-up queries always return result for all data IDs, probably by pruning the QG in Python after constructing it.  After consultation with Nate Lust , we'd like to give that a try, especially because:

      • it means this ticket shouldn't get blocked on butler query improvements that I might promise but never deliver;
      • it should get us some QG pruning support that could prove useful in other ways, such as in the long-promised DM-21904, which our conversation reaffirmed as the long-term goal (assuming I can actually deliver those butler query improvements).

      Yusra AlSayyad and Tim Jenness , I hope you don't mind Nate Lust working on this next, as I think it's a pretty high priority.  I think it's probably a 2-3 week project (maybe much faster if the pruning is easy, but neither of us remembered the code structure well enough to be confident it would be).

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            We've been iterating on the PR for a while now, but while there are a couple of small things I'd like to see improved before merge, I don't think I need to take another look to check that they are, unless you'd like me to.

            Show
            jbosch Jim Bosch added a comment - We've been iterating on the PR for a while now, but while there are a couple of small things I'd like to see improved before merge, I don't think I need to take another look to check that they are, unless you'd like me to.
            Hide
            jbosch Jim Bosch added a comment -

            Second review pass complete. Minor points in the pipe_base and ctrl_mpexec PRs, but I'm hoping we can get away without replaceStorageClass as a method entirely, even if that punts a real fix to later.

            Show
            jbosch Jim Bosch added a comment - Second review pass complete. Minor points in the pipe_base and ctrl_mpexec PRs, but I'm hoping we can get away without replaceStorageClass as a method entirely, even if that punts a real fix to later.
            Hide
            nlust Nate Lust added a comment -

            Tim Jenness replying here instead of threading multiple packages/comments.

            The way we use pipelines in full processing we will know all the info by the end of pipeline processing. However, there is a public method that says "give me dataset types for this specific connections class", and that does not know anything about any other context. So by default I made it so that method would fail if you tried to use a component dataset type that the registry knew nothing about. Optionally you can choose to allow a placeholder that is local to that functionality in pipe_base, as its really the only place where it is relevant.

            This method is called by the full graph builder, and all of those are resolved, but I cant discount that someone would want to use the other method somewhere. So either

            A) we give up on that public method when a composite is unknown and have the full pipeline do its own thing

            B)  Use the daf_butler local placeholder but not have a replace method, and just construct a new dataset type

            C)  leave it as is

            D) rename the daf_butler method to indicate it should be private and use it here

            E) revert to the old behavior

            Show
            nlust Nate Lust added a comment - Tim Jenness replying here instead of threading multiple packages/comments. The way we use pipelines in full processing we will know all the info by the end of pipeline processing. However, there is a public method that says "give me dataset types for this specific connections class", and that does not know anything about any other context. So by default I made it so that method would fail if you tried to use a component dataset type that the registry knew nothing about. Optionally you can choose to allow a placeholder that is local to that functionality in pipe_base, as its really the only place where it is relevant. This method is called by the full graph builder, and all of those are resolved, but I cant discount that someone would want to use the other method somewhere. So either A) we give up on that public method when a composite is unknown and have the full pipeline do its own thing B)  Use the daf_butler local placeholder but not have a replace method, and just construct a new dataset type C)  leave it as is D) rename the daf_butler method to indicate it should be private and use it here E) revert to the old behavior
            Hide
            jbosch Jim Bosch added a comment -

            If you're saying that (A) lets us get rid of the old placeholder and the new placeholder, at the expense of making pipe.base.PipelineDatasetTypes only work when component datasets either already exist in the registry or are components of composites that are also in the pipeline's outputs (i.e. the case that GraphBuilder assumes), I vote for that.

            Show
            jbosch Jim Bosch added a comment - If you're saying that (A) lets us get rid of the old placeholder and the new placeholder, at the expense of making pipe.base.PipelineDatasetTypes only work when component datasets either already exist in the registry or are components of composites that are also in the pipeline's outputs (i.e. the case that GraphBuilder assumes), I vote for that.
            Hide
            tjenness Tim Jenness added a comment -

            If you are going to need a placeholder at all please use the daf_butler placeholder system as is. There is little point removing it from daf_butler but adding it back into pipe_base and loosening up the constraints in changing storage classes of dataset types.

            Show
            tjenness Tim Jenness added a comment - If you are going to need a placeholder at all please use the daf_butler placeholder system as is. There is little point removing it from daf_butler but adding it back into pipe_base and loosening up the constraints in changing storage classes of dataset types.
            Hide
            nlust Nate Lust added a comment -

            I restructured things to remove all need for placeholders, a variant on A. It does mean an extra iterator over tasks in a pipeline, but that is such a small number I cant see how it would matter.

            Show
            nlust Nate Lust added a comment - I restructured things to remove all need for placeholders, a variant on A. It does mean an extra iterator over tasks in a pipeline, but that is such a small number I cant see how it would matter.
            Hide
            yusra Yusra AlSayyad added a comment -

            Note re backport request. Because this is a big change, let's wait until it have made it through a DC2/RC2 reprocessing before backporting.

            Show
            yusra Yusra AlSayyad added a comment - Note re backport request. Because this is a big change, let's wait until it have made it through a DC2/RC2 reprocessing before backporting.
            Hide
            nlust Nate Lust added a comment -

            This feature should not make any difference if you don't use it, but I dont mind either way. It was discussed in middleware meeting that this ticket would make a good test particle backport on how easy things would be.

            Show
            nlust Nate Lust added a comment - This feature should not make any difference if you don't use it, but I dont mind either way. It was discussed in middleware meeting that this ticket would make a good test particle backport on how easy things would be.

              People

              Assignee:
              nlust Nate Lust
              Reporter:
              jbosch Jim Bosch
              Reviewers:
              Jim Bosch
              Watchers:
              Eli Rykoff, Jim Bosch, Lee Kelvin, Meredith Rawls, Nate Lust, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.