Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24734

Revive and profile RC2 QuantumGraph generation

    Details

    • Story Points:
      2
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      DM-24664 should have provided everything we need in terms of a starting data repository, but since we last ran on RC2 we've switched to YAML pipeline definitions, and neither the generic DRP one in pipe_tasks nor the custom one in ci_hsc_gen3 seem quite right. Ideally we'll be able either make the pipe_tasks one viable with just instrument overrides or define a generic HSC+DRP pipeline in RC2, but if necessary we can put one in the Gen3+RC2 script repo.

      After that, time and profile QG generation on RC2 and create tickets to optimize as needed.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            Some timing updates from work done so far:

            • Raw ingest for all 3 tracts, all 5 bands took 2h40m.
            • Define visits took about 10m.
            • Calibration products took about 1h30m (for the full HSC calib repo).
            • brightObjectMask ingestion took about 7m.
            • ref_cat ingestion took seconds.

            This was on SQLite with checksums enabled.  It also failed to put the deepCoadd_skyMap dataset, which then broke QG generation; I think I've fixed the code to do that, but haven't tested it, as I just patched the repo manually instead.

            After that fix, my QuantumGraph generation run for 1 tract (all 5 bands) failed due to DM-24817, but at what should be very close to the end of the operation, so I think it's a reasonable approximation of the full time.  The debug logs I enabled were too fine-grained and blew past my screen history limit, so I don't have detailed numbers, but:

            • The initial big, complex registry query appeared to only take about 15m (more than ideal, but acceptable even if we can't improve it, I think - as long as the time is much shorter for much smaller graphs, such as single-patch ones).
            • The total time was 3h50m, and was dominated by follow-up processing that includes both a lot Python logic and many smaller queries.  I'll need to do an actual profile to say more about the breakdown.

            While the total time is not what I'd consider acceptable, the fact that it's dominated by the follow-up processing is IMO very good news: I think we'll find a lot of opportunities for optimization there once we've profiled.

            My next steps are:

            • tweak the logging so I can get the timing I'm interested in without being swamped by per-data-ID log messages;
            • work around DM-24817, probably just by dropping that (last) problematic Task from the Pipeline for now;
            • run this again with the Python profiler.

            Once I've got a Python profile, I'll switch my environment the latest weekly and repeat the experiment with Oracle (which will also test my fixes to the bootstrap script).

            One solid lesson-learned so far: we'll probably want to implement DM-24432 (calibration collections) before doing too much more in the way of profiling/optimization, because it's already clear that calibration lookups are too slow, and I don't want to spend time optimizing code paths that need to be rewritten for other reasons anyway.

            Show
            jbosch Jim Bosch added a comment - Some timing updates from work done so far: Raw ingest for all 3 tracts, all 5 bands took 2h40m. Define visits took about 10m. Calibration products took about 1h30m (for the full HSC calib repo). brightObjectMask ingestion took about 7m. ref_cat ingestion took seconds. This was on SQLite with checksums enabled.  It also failed to put the deepCoadd_skyMap dataset, which then broke QG generation; I think I've fixed the code to do that, but haven't tested it, as I just patched the repo manually instead. After that fix, my QuantumGraph generation run for 1 tract (all 5 bands) failed due to DM-24817 , but at what should be very close to the end of the operation, so I think it's a reasonable approximation of the full time.  The debug logs I enabled were too fine-grained and blew past my screen history limit, so I don't have detailed numbers, but: The initial big, complex registry query appeared to only take about 15m (more than ideal, but acceptable even if we can't improve it, I think - as long as the time is much shorter for much smaller graphs, such as single-patch ones). The total time was 3h50m, and was dominated by follow-up processing that includes both a lot Python logic and many smaller queries.  I'll need to do an actual profile to say more about the breakdown. While the total time is not what I'd consider acceptable, the fact that it's dominated by the follow-up processing is IMO very good news: I think we'll find a lot of opportunities for optimization there once we've profiled. My next steps are: tweak the logging so I can get the timing I'm interested in without being swamped by per-data-ID log messages; work around DM-24817 , probably just by dropping that (last) problematic Task from the Pipeline for now; run this again with the Python profiler. Once I've got a Python profile, I'll switch my environment the latest weekly and repeat the experiment with Oracle (which will also test my fixes to the bootstrap script). One solid lesson-learned so far: we'll probably want to implement DM-24432 (calibration collections) before doing too much more in the way of profiling/optimization, because it's already clear that calibration lookups are too slow, and I don't want to spend time optimizing code paths that need to be rewritten for other reasons anyway.
            Hide
            jbosch Jim Bosch added a comment -

            I've finished a run with profiling enabled (and the bug that killed the last one avoided by disabling ForcedPhotCcdTask, the last step in the pipeline, and one often skipped for other reasons).

            First off, the profiling overheads appear to significant, unless something else (load, caching) changed under me: the last run took just under 4 hours total, while this one took about 6 hours.

            A breakdown from status logging:

            • 23m: executing and iterating over rows from the big initial query
            • 10m: resolving (finding dataset_ids and runs) for input datasets (almost entirely raws)
            • 1h20m: assembling "isr" quanta (includes calibration dataset lookups)
            • 1h10m: assembling "charImage" quanta
            • 2h: assembling "calibrate" quanta (includes refcat lookups)
            • 30m: assembling "makeWarp" quanta
            • 5m: assembling quanta for all  other tasks combined

            And the big news is that the profile (to the extent it can be trusted, given the overheads) says almost all of the time is going into Registry.relateDataIds, which we use to test each QuantumDatasetRef pair for compatibility while assembling the Quantum.  I think the vastly different amounts of time spent for different tasks just corresponds to the number of such combinations for those tasks (given their DatasetTypes and the number of data IDs for them), and now that I think about it, there's no way that testing scales linearly with the size of the graph - it's got to be more like O(N^2), because the number of tests scales as the product of the number of Quanta and the number of DatasetRefs.

            The good news is that I think I see a way to restructure things to avoid calling Registry.relateDataIds (and this O(N^2) behavior) entirely, because the rows of the big initial query tell us everything we need about QuantumDatasetRef relationships.  But that will take some restructuring because we've thrown those rows away by the time we assemble the quanta themselves.

            An aside for Nate Lust: I'm a bit worried the design you're implementing on may run into the same scaling problem, or something very much like it.  It would be worthwhile to do the thought experiment of walking through all of the steps looking explicitly any whose number scales as the product of the number of two kinds of data IDs.

            Show
            jbosch Jim Bosch added a comment - I've finished a run with profiling enabled (and the bug that killed the last one avoided by disabling ForcedPhotCcdTask, the last step in the pipeline, and one often skipped for other reasons). First off, the profiling overheads appear to significant, unless something else (load, caching) changed under me: the last run took just under 4 hours total, while this one took about 6 hours. A breakdown from status logging: 23m: executing and iterating over rows from the big initial query 10m: resolving (finding dataset_ids and runs) for input datasets (almost entirely raws) 1h20m: assembling "isr" quanta (includes calibration dataset lookups) 1h10m: assembling "charImage" quanta 2h: assembling "calibrate" quanta (includes refcat lookups) 30m: assembling "makeWarp" quanta 5m: assembling quanta for all  other tasks combined And the big news is that the profile (to the extent it can be trusted, given the overheads) says almost all of the time is going into Registry.relateDataIds , which we use to test each  Quantum +  DatasetRef pair for compatibility while assembling the Quantum.  I think the vastly different amounts of time spent for different tasks just corresponds to the number of such combinations for those tasks (given their DatasetTypes and the number of data IDs for them), and now that I think about it, there's no way that testing scales linearly with the size of the graph - it's got to be more like O(N^2), because the number of tests scales as the product of the number of Quanta and the number of DatasetRefs. The good news is that I think I see a way to restructure things to avoid calling Registry.relateDataIds (and this O(N^2) behavior) entirely, because the rows of the big initial query tell us everything we need about  Quantum +  DatasetRef relationships.  But that will take some restructuring because we've thrown those rows away by the time we assemble the quanta themselves. An aside for Nate Lust : I'm a bit worried the design you're implementing on may run into the same scaling problem, or something very much like it.  It would be worthwhile to do the thought experiment of walking through all of the steps looking explicitly any whose number scales as the product of the number of two kinds of data IDs.
            Hide
            jbosch Jim Bosch added a comment -

            Good news: after fixing the O(N^2) logic as described above, the total time is down 1h23m (it was a lot easier than I thought).  This was once again with 1 tract, all 5 bands, with profiling enabled, on SQLite.

            The breakdown now is:

            1. 59%: ref_cat and (especially) calibration product lookups.  The latter is slated to be rewritten on DM-24432 anyway (both interface and implementation), and I'll work on it there.  About 1/3 of this time is just constructing essentially the same query and over again (for different dataset types and data IDs), and that's very useful information going into the rewrite.
            2. 17%: follow-up small queries to expand data IDs.  I think almost all of this can be avoiding with some Python-side caching of dimension records - we already cache (and never expire) dimensions for which we expect the total number to be small (instruments, detectors, filters), but we need to either implement some expiration system to safely extend this to other dimensions.
            3. 12%: raw dataset_id and run lookups.  This is running the same query over and over, but it's simple enough in this case that essentially all the time is spent in SQLAlchemy executing the query or fetching rows.  A big question is whether there is significant room for improvement on the database side, or if we're just running it so many times we get killed by the overheads (which will be worth with Oracle).  If the latter we may be able to vectorize it somehow, or even merge it with the big initial query (what we used to do, and what led to the super-complex subqueries we used to have).
            4. ~10%: pure Python logic and built-in container operations (dicts, sets) involved in processing data ID results from the big initial query.  There may be some common-case shortcuts that could be improved to sidestep some of this, but for the most part I think this is the cost of doing business in Python.  A C/C++[/Rust?] DataCoordinate class would probably crush this to nothing, and might be a project for a rainy day when the important stuff is long behind me.
            5. 1.6%: executing and fetching rows from the big initial query itself.  I am very happy that this fraction is so small, because it's the part that I think would be hardest to optimize (for me, at least).

            A few more thoughts on optimization that aren't as directly tied to any of the above:

            • I think I'd like to start "remembering" which dataset types are in at least some collections, so we can simplify a lot of the queries we run (reducing and probably mostly eliminating UNION ALL subqueries).  There are a few ways to approach this, and I'll create a ticket for it later today.  This could affect (1), (3), and (5) above.
            • We should try to shrink the padding added to the HSC raw regions (DM-24024).  That could reduce the number of patch/detector overlaps and hence the number of initial query result rows significantly, which should speed up both (2) and (4) just by having that code run a smaller number of times.
            Show
            jbosch Jim Bosch added a comment - Good news: after fixing the O(N^2) logic as described above, the total time is down 1h23m (it was a lot easier than I thought).  This was once again with 1 tract, all 5 bands, with profiling enabled, on SQLite. The breakdown now is: 59%: ref_cat and (especially) calibration product lookups.  The latter is slated to be rewritten on DM-24432 anyway (both interface and implementation), and I'll work on it there.  About 1/3 of this time is just constructing essentially the same query and over again (for different dataset types and data IDs), and that's very useful information going into the rewrite. 17%: follow-up small queries to expand data IDs.  I think almost all of this can be avoiding with some Python-side caching of dimension records - we already cache (and never expire) dimensions for which we expect the total number to be small (instruments, detectors, filters), but we need to either implement some expiration system to safely extend this to other dimensions. 12%: raw dataset_id and run lookups.  This is running the same query over and over, but it's simple enough in this case that essentially all the time is spent in SQLAlchemy executing the query or fetching rows.  A big question is whether there is significant room for improvement on the database side, or if we're just running it so many times we get killed by the overheads (which will be worth with Oracle).  If the latter we may be able to vectorize it somehow, or even merge it with the big initial query (what we used to do, and what led to the super-complex subqueries we used to have). ~10%: pure Python logic and built-in container operations (dicts, sets) involved in processing data ID results from the big initial query.  There may be some common-case shortcuts that could be improved to sidestep some of this, but for the most part I think this is the cost of doing business in Python.  A C/C++ [/Rust?] DataCoordinate class would probably crush this to nothing, and might be a project for a rainy day when the important stuff is long behind me. 1.6%: executing and fetching rows from the big initial query itself.  I am very happy that this fraction is so small, because it's the part that I think would be hardest to optimize (for me, at least). A few more thoughts on optimization that aren't as directly tied to any of the above: I think I'd like to start "remembering" which dataset types are in at least some collections, so we can simplify a lot of the queries we run (reducing and probably mostly eliminating UNION ALL subqueries).  There are a few ways to approach this, and I'll create a ticket for it later today.  This could affect (1), (3), and (5) above. We should try to shrink the padding added to the HSC raw regions ( DM-24024 ).  That could reduce the number of patch/detector overlaps and hence the number of initial query result rows significantly, which should speed up both (2) and (4) just by having that code run a smaller number of times.
            Hide
            jbosch Jim Bosch added a comment -

            I've finally fixed the silly bugs that prevented me from getting Oracle runs through.

            We'll start with bootstrap timings. Total time was much longer: 10h15m. That breaks down into:

            • 2h50m ingesting raws (10m slower than SQLite)
            • 2h40m defining visits (2h30m slower than SQLite).
            • 3h7m expanding data IDs for brightObjectMask: I didn't even bother to write down how much time this took in SQLite, because it was tiny. Christopher Stephens spotted this as a problem in the Oracle logs before I even finished the profile: it's a ton of tiny, repetitive fetches from the patch table, and I've already opened a ticket to fix it.
            • 1h20m ingesting calibration products (10m faster than SQLite).

            So, most of the difference from SQLite (expanding data IDs) is well-understood and definitely fixable, though it may take a bit of restructuring - not only are we pulling down many of the same rows over and over again, we're pulling down rows another part of the code just inserted. So at worse we just need to make those different parts of the code communicate better.

            The difference in visit definition doesn't have as much of a smoking gun, but it both inserts a lot of fairly big rows (regions) into the database and does some of its own possibly-repetitive fetching. I've created DM-24968 for this.

            Show
            jbosch Jim Bosch added a comment - I've finally fixed the silly bugs that prevented me from getting Oracle runs through. We'll start with bootstrap timings. Total time was much longer: 10h15m. That breaks down into: 2h50m ingesting raws (10m slower than SQLite) 2h40m defining visits (2h30m slower than SQLite). 3h7m expanding data IDs for brightObjectMask: I didn't even bother to write down how much time this took in SQLite, because it was tiny. Christopher Stephens spotted this as a problem in the Oracle logs before I even finished the profile: it's a ton of tiny, repetitive fetches from the patch table, and I've already opened a ticket to fix it. 1h20m ingesting calibration products (10m faster than SQLite). So, most of the difference from SQLite (expanding data IDs) is well-understood and definitely fixable, though it may take a bit of restructuring - not only are we pulling down many of the same rows over and over again, we're pulling down rows another part of the code just inserted. So at worse we just need to make those different parts of the code communicate better. The difference in visit definition doesn't have as much of a smoking gun, but it both inserts a lot of fairly big rows (regions) into the database and does some of its own possibly-repetitive fetching. I've created DM-24968 for this.
            Hide
            jbosch Jim Bosch added a comment - - edited

            And now, the Oracle QuantumGraph-generation timing report. Once again, things are significantly slower, almost a factor of two: the same test (1 tract, 5 bands, profiling enabled) that took 1h23m on SQLite took 2h18m on Oracle. Here's the breakdown and comparison :

            1. 39%/53m: calibration product lookups (59%/49m in SQLite)
            2. 35%/47m: expanding data IDs (17%/14m in SQLite)
            3. 13%/18m: raw dataset_id and run lookups (12%/10m in SQLite)
            4. 11%/16m: pure-Python data ID manipulations (10%/8m in SQLite)
            5. 1%/80s: executing and fetching rows from the big initial query (1.6%/80s in SQLite)

            For the most part, the conclusions about what the problems are are the same, but even stronger: it's the thousands of tiny follow-up queries for dimension records (in expanding data IDs) and looking up datasets that are killing us. It seems probable that the most important difference from SQLite is just the addition of network latency to the mix, but fixing the problems we saw in SQLite should also take care of the worst problems in Oracle, and most of what seems to be making Oracle than SQLite3.

            But there are also seeming inconsistencies here between the (wall-clock) times I reported overall and the profile, and I suspect the profile breakdown might not actually show some wall-clock time being spent in the database (which may also explain why the pure-Python chunk I called out above seemingly took longer with Oracle). I'll look into that later if necessary once the more obvious problems are dealt with. For now, I'd just like to get the improvements I already have made on this ticket out for review.

            Show
            jbosch Jim Bosch added a comment - - edited And now, the Oracle QuantumGraph-generation timing report. Once again, things are significantly slower, almost a factor of two: the same test (1 tract, 5 bands, profiling enabled) that took 1h23m on SQLite took 2h18m on Oracle. Here's the breakdown and comparison : 39%/53m: calibration product lookups (59%/49m in SQLite) 35%/47m: expanding data IDs (17%/14m in SQLite) 13%/18m: raw dataset_id and run lookups (12%/10m in SQLite) 11%/16m: pure-Python data ID manipulations (10%/8m in SQLite) 1%/80s: executing and fetching rows from the big initial query (1.6%/80s in SQLite) For the most part, the conclusions about what the problems are are the same, but even stronger: it's the thousands of tiny follow-up queries for dimension records (in expanding data IDs) and looking up datasets that are killing us. It seems probable that the most important difference from SQLite is just the addition of network latency to the mix, but fixing the problems we saw in SQLite should also take care of the worst problems in Oracle, and most of what seems to be making Oracle than SQLite3. But there are also seeming inconsistencies here between the (wall-clock) times I reported overall and the profile, and I suspect the profile breakdown might not actually show some wall-clock time being spent in the database (which may also explain why the pure-Python chunk I called out above seemingly took longer with Oracle). I'll look into that later if necessary once the more obvious problems are dealt with. For now, I'd just like to get the improvements I already have made on this ticket out for review.
            Hide
            jbosch Jim Bosch added a comment -

            Nate Lust, I'm afraid you're the logical review for this one. There are many branches, but most now exist just to appease Jenkins and have been reset to master; the only ones that won't be deleted before merge are:

            • pipe_base (https://github.com/lsst/pipe_base/pull/128): optimization of QuantumGraph generation by relating dataset and quantum data IDs immediately when processing results from the big initial query;
            • gen3-hsc-rc2 (https://github.com/lsst-dm/gen3-hsc-rc2/pull/4): this is in lsst-dm, so it doesn't need a formal review, but I wanted to keep you appraised of both where the pipeline YAML definition I've been using is (here!) and the workaround for making sure we've got skymap dataset in the repo. Note that I decided not to try to normalize the pipeline definitions across pipe_tasks, ci_hsc_gen3, [obs_subaru], and here, as in the original ticket description - I think it makes sense to do DM-24817 first.
            Show
            jbosch Jim Bosch added a comment - Nate Lust , I'm afraid you're the logical review for this one. There are many branches, but most now exist just to appease Jenkins and have been reset to master; the only ones that won't be deleted before merge are: pipe_base ( https://github.com/lsst/pipe_base/pull/128): optimization of QuantumGraph generation by relating dataset and quantum data IDs immediately when processing results from the big initial query; gen3-hsc-rc2 ( https://github.com/lsst-dm/gen3-hsc-rc2/pull/4): this is in lsst-dm, so it doesn't need a formal review, but I wanted to keep you appraised of both where the pipeline YAML definition I've been using is (here!) and the workaround for making sure we've got skymap dataset in the repo. Note that I decided not to try to normalize the pipeline definitions across pipe_tasks, ci_hsc_gen3, [obs_subaru] , and here, as in the original ticket description - I think it makes sense to do DM-24817 first.
            Hide
            jbosch Jim Bosch added a comment -

            Nate Lust, did you want to take another look?  If not, please hit "Reviewed".

            Show
            jbosch Jim Bosch added a comment - Nate Lust , did you want to take another look?  If not, please hit "Reviewed".

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Reviewers:
                Nate Lust
                Watchers:
                Christopher Stephens, Jim Bosch, Nate Lust, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel