Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-17496

QuantumGraph generation hits SQLite join limit

    Details

      Description

      Nate Lust reports that trying to build a single QuantumGraph for all of the PipelineTasks in ci_hsc runs into SQLite's hard-coded maximum of 64 joins a single query.  I can think of several ways to avoid this, many involving temporary tables (I know those work differently in Oracle; I think my ideas are compatible with that).

      For the demo we'll plan on just running the pipelines in three stages, for which QuantumGraph generation works fine.  If everything else is going extremely well early in the week I may take a stab at resolving this, too.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            Simplest approach seems to have done the trick: I've removed all output datasets from the big selectDimensions query, since they didn't constrain it anyway, and we were already not relying on that query returning dataset_ids for them.

            After that, some profiling revealed that the slow overall speed of preflight (12 minutes for ci_hsc!) was not being spent in that query - it was in slow, mostly Python data ID manipulations afterwards.  All but the first commit on this branch are optimizations for that.  It's down to 5 minutes now, and while I have ideas on how to go further, I think they're out of scope for this ticket.

            Andy Salnikov, could you look at the first commit?

            Nate Lust, could you look at the rest?

            PR is: https://github.com/lsst/daf_butler/pull/124

             

            Show
            jbosch Jim Bosch added a comment - - edited Simplest approach seems to have done the trick: I've removed all output datasets from the big selectDimensions query, since they didn't constrain it anyway, and we were already not relying on that query returning dataset_ids for them. After that, some profiling revealed that the slow overall speed of preflight (12 minutes for ci_hsc!) was not being spent in that query - it was in slow, mostly Python data ID manipulations afterwards.  All but the first commit on this branch are optimizations for that.  It's down to 5 minutes now, and while I have ideas on how to go further, I think they're out of scope for this ticket. Andy Salnikov , could you look at the first commit? Nate Lust , could you look at the rest? PR is: https://github.com/lsst/daf_butler/pull/124  
            Hide
            salnikov Andy Salnikov added a comment -

            GraphBuilder uses output dataset_ids for two purposes - either filter out quanta with all existing outputs from a graph or validate that outputs do not already exist. Setting all output dataset_ids to None will just disable that functionality, if we want to keep it we'd need to do some extra work in GraphBuilder or in pre-flight (which will probably slow it down again).

            I looked at the preflight commit, looks OK, but I would probably keep _buildDatasetSubquery() unchanged for now, just in case we'd want to reuse it for something else (subsets?).

            Show
            salnikov Andy Salnikov added a comment - GraphBuilder uses output dataset_ids for two purposes - either filter out quanta with all existing outputs from a graph or validate that outputs do not already exist. Setting all output dataset_ids to None will just disable that functionality, if we want to keep it we'd need to do some extra work in GraphBuilder  or in pre-flight (which will probably slow it down again). I looked at the preflight commit, looks OK, but I would probably keep _buildDatasetSubquery() unchanged for now, just in case we'd want to reuse it for something else (subsets?).
            Hide
            jbosch Jim Bosch added a comment -

            Glad you caught that, Andy Salnikov.  I'll look into updating GraphBuilder accordingly and make sure I test that functionality before I merge.  I like the idea of testing for dataset existence late because it should work better for use cases that involve saving the QuantumGraph and reading it back later (not that I expect the database to change frequently in between, but it would be good to at least handle any unrecoverable inconsistencies gracefully).

            Show
            jbosch Jim Bosch added a comment - Glad you caught that, Andy Salnikov .  I'll look into updating GraphBuilder accordingly and make sure I test that functionality before I merge.  I like the idea of testing for dataset existence late because it should work better for use cases that involve saving the QuantumGraph and reading it back later (not that I expect the database to change frequently in between, but it would be good to at least handle any unrecoverable inconsistencies gracefully).
            Hide
            jbosch Jim Bosch added a comment -

            Andy Salnikov, I believe I've addressed your review comments and cleaned up the mess from the flurry of activity last week; please take another look.  In particular:

            • The daf_butler changes are now limited to just the removal of output datasets from the big selectDimensions query, which is now optional and configurable.  Even more importantly, I've updated _convertResultRows to query for output {{dataset_id}}s at that stage when it isn't done in the big query, so now the overall behavior is the same and the configuration option to defer the lookups can be seen as just an alternate implementation.
            • This also made it easy to ensure that the DatasetRefs returned by selectDimensions are always expanded to include components when they represent composites (for both implementation code paths).  I've consequently removed the workaround that previously did this in GraphBuilder in pipe_base instead.
            • The data ID optimization commits in daf_butler that were previously on this branch have been moved to DM-17611, and merged from there to master.  Those also required a change in skymap, which has also been merged to master - GitHub and Jira  present a slightly confusing view of what's happened there; but you can completely ignore the apparently-merged skymap PR for this ticket.
            Show
            jbosch Jim Bosch added a comment - Andy Salnikov , I believe I've addressed your review comments and cleaned up the mess from the flurry of activity last week; please take another look.  In particular: The daf_butler changes are now limited to just the removal of output datasets from the big  selectDimensions query, which is now optional and configurable.  Even more importantly, I've updated _convertResultRows to query for output {{dataset_id}}s at that stage when it isn't done in the big query, so now the overall behavior is the same and the configuration option to defer the lookups can be seen as just an alternate implementation. This also made it easy to ensure that the DatasetRefs returned by selectDimensions are always expanded to include components when they represent composites (for both implementation code paths).  I've consequently removed the workaround that previously did this in GraphBuilder in pipe_base instead. The data ID optimization commits in daf_butler that were previously on this branch have been moved to DM-17611 , and merged from there to master.  Those also required a change in skymap, which has also been merged to master - GitHub and Jira  present a slightly confusing view of what's happened there; but you can completely ignore the apparently-merged skymap PR for this ticket .
            Hide
            salnikov Andy Salnikov added a comment -

            I have approved both PRs, looks good to me, with few minor comments. Indeed new logic looks more straightforward, though performance penalty can be significant but that can be improved later.

            Nate Lust, please mark as reviewed once you are satisfied.

             

            Show
            salnikov Andy Salnikov added a comment - I have approved both PRs, looks good to me, with few minor comments. Indeed new logic looks more straightforward, though performance penalty can be significant but that can be improved later. Nate Lust , please mark as reviewed once you are satisfied.  

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Reviewers:
                Andy Salnikov, Nate Lust
                Watchers:
                Andy Salnikov, Jim Bosch, Nate Lust
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel