Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-24938

Add system for pre-fetching, caching, and in-memory joins of dimension information

    Details

    • Story Points:
      15
    • Epic Link:
    • Team:
      Data Release Production
    • Urgent?:
      No

      Description

      DM-24734 revealed that repeated lookups for the same dimension records make up a significant fraction of both Gen3 RC2 bootstrap (~47% in Oracle) and QuantumGraph generation (17% in SQLite). This could be at least mostly addressed by caching more of those lookups, but we need to define interfaces that allow user code to make it clear when caches can be expired, and it would be even better to be able to do bulk fetches of these records to populate those caches up-front.

      At the same time, discussions with Nate Lust about DM-21904 have revealed a very similar use case for pre-fetching and then making simple relationship queries (e.g. "give me the tracts related to this visit") against an in-memory data structure.

      I have an idea for how to deal with both of these together, and I think I just need to work out the interfaces in more detail with Nate to ensure it will meet his needs.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -
            Show
            jbosch Jim Bosch added a comment - Current design sketch: https://gist.github.com/TallJimbo/1a2262743e5bfc34a8c1dd6100ea98d9
            Hide
            jbosch Jim Bosch added a comment -

            This has moved away from the original ticket description, in the direction of putting original query results into a temporary table so we can do the follow-up joins and other queries in the database without having to repeat the original big query.  The functional implementation is done, but it needs lots of documentation and commit-cleanup, and I need to check the test coverage (for the most part, the new functionality is covered by one mega-test I added at the end).

            Show
            jbosch Jim Bosch added a comment - This has moved away from the original ticket description, in the direction of putting original query results into a temporary table so we can do the follow-up joins and other queries in the database without having to repeat the original big query.  The functional implementation is done, but it needs lots of documentation and commit-cleanup, and I need to check the test coverage (for the most part, the new functionality is covered by one mega-test I added at the end).
            Hide
            jbosch Jim Bosch added a comment -

            I've just finished the first profiles for QuantumGraph generation with this branch.  Let's start by recapping where we were when we started - basically the status at the end of DM-24734, but I've re-run those profiles against the w_2020_25 and am reporting those more recent results to make sure it's a fair comparison for this branch.

            Test is building a single-tract quantum graph for the full DRP Gen3 pipeline, with the database loaded with all three RC2 tracts, in a SQLite database on /project.

            On w_2020_25 (should be the same as master), total time was 57m54s:

            • 26m37s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs.  These steps are not easily separated in the logs on master, because they're interleaved.  The profile lets us break them down, but the times are measured slightly differently so this is approximate:
              • 2m0s: executing the big initial query
              • 16m40s: expanding data IDs
              • 7m95s: associating quantum and dataset data IDs (probably slightly overestimated relative to the others)
            • 1m6s: follow-up queries for regular inputs (raw and deepCoadd_skyMap)
            • 27m28s: follow-up queries for ISR prerequisites (calibrations)
            • 1m9s: follow-up queries for calibrate prerequisites (refcats)

            On this branch, total time was 54m51s:

            • 8m13s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs.  These steps are now separable in the logs, yielding:
              • 1m25s: executing the big initial query
              • 0m2s: expanding data IDs
              • 6m46s: associating quantum and dataset data IDs
            • 0m4s: follow-up queries for regular inputs (raw and deepCoadd_skyMap)
            • 42m57s: follow-up queries for ISR prerequisites (calibrations)
            • 1m21s: follow-up queries for calibrate prerequisites (refcats)

            At first glance, it looks like things barely improved, because some steps got faster while the single slowest step got slower, but if you connect this to what the changes were supposed to do, it's really promising.  My goals on this ticket were:

            • Speed up data ID expansion: absolutely crushed this, going from ~17m to 2s.
            • Speed up "normal" (non-prerequisite) dataset lookups: crushed this, going from 1m down to 4s.
            • Set the stage for DM-24432, where I'd work on speeding up calibration lookups.  That's what's gotten slower here, and it's because I didn't try to replace this loop over (of order) 20-50k complex, single-row queries with something vectorized; instead I made it a bit slower to construct each of those queries because the normal dataset lookup path (which I did vectorize) uses some of the same code.

            So, I think this ticket has accomplished its goals (assuming these results hold up in PostgreSQL), and it's time to clean it up and move on to DM-24432.  I'm pretty confident I can get that ~40m block down to <5m at the worst, and possibly <1m.

            The part of these profiles that's going to be hardest to improve is probably the ~7m we take associating quantum and dataset data IDs, as that's pure Python with no database stuff involved (note that there's been no real attempt to optimize the database indices or structure to improve these queries, aside from the big-picture changes we've made since the last round of RC2+Gen3 work over a year ago).  Dropping spurious overlaps (DM-24024, for HSC, at least) should help by reducing the amount of work to actually do there, but I don't really know how much.  There might be a way to use the database more in that step, especially now that the thing we iterate over there is already materialized in a temporary table, and it's probable that DM-21904 would move us in that direction anyway.  Something to ponder once DM-24432 is done and we've got QG generation down from almost an hour to (hopefully) 10-12 minutes.

             

            Show
            jbosch Jim Bosch added a comment - I've just finished the first profiles for QuantumGraph generation with this branch.  Let's start by recapping where we were when we started - basically the status at the end of DM-24734 , but I've re-run those profiles against the w_2020_25 and am reporting those more recent results to make sure it's a fair comparison for this branch. Test is building a single-tract quantum graph for the full DRP Gen3 pipeline, with the database loaded with all three RC2 tracts, in a SQLite database on /project. On w_2020_25 (should be the same as master), total time was 57m54s: 26m37s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs.  These steps are not easily separated in the logs on master, because they're interleaved.  The profile lets us break them down, but the times are measured slightly differently so this is approximate: 2m0s: executing the big initial query 16m40s: expanding data IDs 7m95s: associating quantum and dataset data IDs (probably slightly overestimated relative to the others) 1m6s: follow-up queries for regular inputs (raw and deepCoadd_skyMap) 27m28s: follow-up queries for ISR prerequisites (calibrations) 1m9s: follow-up queries for calibrate prerequisites (refcats) On this branch, total time was 54m51s: 8m13s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs.  These steps are now separable in the logs, yielding: 1m25s: executing the big initial query 0m2s: expanding data IDs 6m46s: associating quantum and dataset data IDs 0m4s: follow-up queries for regular inputs (raw and deepCoadd_skyMap) 42m57s: follow-up queries for ISR prerequisites (calibrations) 1m21s: follow-up queries for calibrate prerequisites (refcats) At first glance, it looks like things barely improved, because some steps got faster while the single slowest step got slower, but if you connect this to what the changes were supposed to do, it's really promising.  My goals on this ticket were: Speed up data ID expansion: absolutely crushed this, going from ~17m to 2s. Speed up "normal" (non-prerequisite) dataset lookups: crushed this, going from 1m down to 4s. Set the stage for DM-24432 , where I'd work on speeding up calibration lookups.  That's what's gotten slower here, and it's because I didn't try to replace this loop over (of order) 20-50k complex, single-row queries with something vectorized; instead I made it a bit slower to construct each of those queries because the normal dataset lookup path (which I did vectorize) uses some of the same code. So, I think this ticket has accomplished its goals (assuming these results hold up in PostgreSQL), and it's time to clean it up and move on to DM-24432 .  I'm pretty confident I can get that ~40m block down to <5m at the worst, and possibly <1m. The part of these profiles that's going to be hardest to improve is probably the ~7m we take associating quantum and dataset data IDs, as that's pure Python with no database stuff involved (note that there's been no real attempt to optimize the database indices or structure to improve these queries, aside from the big-picture changes we've made since the last round of RC2+Gen3 work over a year ago).  Dropping spurious overlaps ( DM-24024 , for HSC, at least) should help by reducing the amount of work to actually do there, but I don't really know how much.  There might be a way to use the database more in that step, especially now that the thing we iterate over there is already materialized in a temporary table, and it's probable that DM-21904 would move us in that direction anyway.  Something to ponder once DM-24432 is done and we've got QG generation down from almost an hour to (hopefully) 10-12 minutes.  
            Hide
            jbosch Jim Bosch added a comment -

            Timing results for PostgreSQL, performing the same test with the same branches.

            Here's w_2020_25, which was almost twice as slow as the same test on SQLite at 1h55m23s:

            • 1h3m12s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs
              • 46m30s: expanding data IDs
            • 0m54s: follow-up queries for regular inputs (raw and deepCoadd_skyMap)
            • 47m56s: follow-up queries for ISR prerequisites (calibrations)
            • 2m0s: follow-up queries for calibrate prerequisites (refcats)

            and on this ticket branch:

            • 0m26s: executing the big initial query
            • 0m2s: expanding data IDs
            • 6m1s: associating quantum and dataset data IDs
            • 0m4s: follow-up queries for regular inputs (raw and deepCoadd_skyMap)
            • 47m42s: follow-up queries for ISR prerequisites (calibrations)
            • 2m26s: follow-up queries for calibrate prerequisites (refcats)

            So, the previous performance was much worse for PostgreSQL, and this was due in large part to data ID expansion being much slower.  As hoped, this ticket fixed that, bringing PostgreSQL performance in line with SQLite for essentially all steps.

            Show
            jbosch Jim Bosch added a comment - Timing results for PostgreSQL, performing the same test with the same branches. Here's w_2020_25, which was almost twice as slow as the same test on SQLite at 1h55m23s: 1h3m12s: executing the big initial query, expanding its data IDs, and associating quanta and dataset data IDs 46m30s: expanding data IDs 0m54s: follow-up queries for regular inputs (raw and deepCoadd_skyMap) 47m56s: follow-up queries for ISR prerequisites (calibrations) 2m0s: follow-up queries for calibrate prerequisites (refcats) and on this ticket branch: 0m26s: executing the big initial query 0m2s: expanding data IDs 6m1s: associating quantum and dataset data IDs 0m4s: follow-up queries for regular inputs (raw and deepCoadd_skyMap) 47m42s: follow-up queries for ISR prerequisites (calibrations) 2m26s: follow-up queries for calibrate prerequisites (refcats) So, the previous performance was much worse for PostgreSQL, and this was due in large part to data ID expansion being much slower.  As hoped, this ticket fixed that, bringing PostgreSQL performance in line with SQLite for essentially all steps.
            Hide
            jbosch Jim Bosch added a comment -

            All code for this ticket has either been merged to master (via DM-25776) or is in review on DM-25919, and I've deleted the original branches with this ticket number. I'm keeping the SP estimate here (and bumping it up quite a bit to reflect how long it actually took, though that's still just an estimate). I'll close this ticket when DM-25919 is merged.

            Show
            jbosch Jim Bosch added a comment - All code for this ticket has either been merged to master (via DM-25776 ) or is in review on DM-25919 , and I've deleted the original branches with this ticket number. I'm keeping the SP estimate here (and bumping it up quite a bit to reflect how long it actually took, though that's still just an estimate). I'll close this ticket when DM-25919 is merged.

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Watchers:
                Dino Bektesevic, Jim Bosch, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel