Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15476

Optimize DiaSource access in l1dbproto

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: L1 Database
    • Labels:

      Description

      DiaSource table takes larges fraction of time when reading data, even at 1-month history depth, this will definitely get worse as we scale it to 12 months.

      Couple oif things to investigate in this ticket:

      • Check if region-based select can work better than ID-based select
      • Look at the indexing options for DiaSource to see if current indexing can be improved (without bit redesign of the whole thing).

      Potentially look also at other options:

      • partitioning of DiaSource
      • pre-fetching of (a part of a) DiaSource history
      • in-memory database table for DiaSource history

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment - - edited

            Started working on understanding/improving DiaSource select performance. Notebook: https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-15476%20Optimize%20DiaSource%20access.ipynb

            Also ideas are on page 16 of my slides from PCW: https://project.lsst.org/meetings/lsst2018/sites/lsst.org.meetings.lsst2018/files/l1dbproto.pdf

            First thing to try is region-based select. I do not actually expect it to be faster than ID-based select, but it would allow us to run DiaSource select without waiting for DiaObject query to finish, basically try to increase parallelism (though more parallelism may not be an option because of bottleneck on Oracle side).

            [Copy&paste from notebook]
            In previous tests DiaSource was read using DiaObject IDs, I think in my very early tests with MySQL/Postgres I compared ID-based select with region-based and ID-based one was faster (but apparently I have no record of that in notebooks). I want to repeat comparison with Oracle to quantify the effect.

            Naively it would seem that both types of selects should performs similarly:

            • DiaSource PK is diaSourceId
            • access based on DiaObject ID should be using index on diaObjectId column
            • similarly region-based access should be using index on pixelId column
            • difference in SELECT is equality match WHERE diaObjectId IN (:id1, :id2, :id3) vs. WHERE pixelId BETWEEN :p1 AND :p2 OR pixelId BETWEEN :p3 and :p4 ..., range-based select may be mode complicated for optimizer (but should not be)
            • another difference is that region-based select will return more data that needed as region envelope will always be larger, how many more depends on few factors, but typically will be 50-100% more (which may be important as we include history of all selected sources).

             For reference this is how DisSource select performed in previous ticket where is was done by diaObjectId:

             
            And here is how it works with spacial selection:

            Spacial SELECT seems to run approximately 50% slower than ID-based select with the current settings.

            It could be that we are selecting too many Sources, but I doubt that it would cost us extra 50% in time. Still I can try to increase granularity of the spacial select to see its effect on that time.

            Show
            salnikov Andy Salnikov added a comment - - edited Started working on understanding/improving DiaSource select performance. Notebook: https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-15476%20Optimize%20DiaSource%20access.ipynb Also ideas are on page 16 of my slides from PCW: https://project.lsst.org/meetings/lsst2018/sites/lsst.org.meetings.lsst2018/files/l1dbproto.pdf First thing to try is region-based select. I do not actually expect it to be faster than ID-based select, but it would allow us to run DiaSource select without waiting for DiaObject query to finish, basically try to increase parallelism (though more parallelism may not be an option because of bottleneck on Oracle side). [Copy&paste from notebook] In previous tests DiaSource was read using DiaObject IDs, I think in my very early tests with MySQL/Postgres I compared ID-based select with region-based and ID-based one was faster (but apparently I have no record of that in notebooks). I want to repeat comparison with Oracle to quantify the effect. Naively it would seem that both types of selects should performs similarly: DiaSource PK is diaSourceId access based on DiaObject ID should be using index on diaObjectId column similarly region-based access should be using index on pixelId column difference in SELECT is equality match  WHERE diaObjectId IN (:id1, :id2, :id3)  vs.  WHERE pixelId BETWEEN :p1 AND :p2 OR pixelId BETWEEN :p3 and :p4 ... , range-based select may be mode complicated for optimizer (but should not be) another difference is that region-based select will return more data that needed as region envelope will always be larger, how many more depends on few factors, but typically will be 50-100% more (which may be important as we include history of all selected sources).  For reference this is how DisSource select performed in previous ticket where is was done by diaObjectId:   And here is how it works with spacial selection: Spacial SELECT seems to run approximately 50% slower than ID-based select with the current settings. It could be that we are selecting too many Sources, but I doubt that it would cost us extra 50% in time. Still I can try to increase granularity of the spacial select to see its effect on that time.
            Hide
            salnikov Andy Salnikov added a comment -

            Returning to this ticket for few more quick tests. I have just tried to increase the granularity of the regions (make them smaller) when doing spatial selection, for both DiaObjectLast and DiaSource. This had very negative effect on Oracle, bloody thing just froze and needed 100 times longer to run those queries. Restoring the granularity back to previous setting made it to behave reasonably again.

            Few more details - the granularity setting works by specifying number of ranges when calculating the envelope for a spacial region:

                pixelator = HtmPixelization(HTM_LEVEL)
                pixel_ranges = pixelator.envelope(region, NUM_RANGES)
            

            The NUM_RANGES above is what determines maximum number of index ranges on the output and as the result it can be used to control granularity.

            The query is generated from that set of pixel ranges would look like (in meta-language):

            SELECT *
            FROM DiaObjectLast
            WHERE
                pixelId BETWEEN :pixel_ranges[0][0] AND :pixel_ranges[0][1] OR
                pixelId BETWEEN :pixel_ranges[1][0] AND :pixel_ranges[1][1] OR
                pixelId BETWEEN :pixel_ranges[2][0] AND :pixel_ranges[2][1] OR
                ... and so on
            

            All my previous tests were done with NUM_RANGES=64, here I tried to use NUM_RANGES=256 and that causes Oracle meltdown, likely an optimizer has difficult time to do something with WHERE containing 256 sub-expressions.

            Show
            salnikov Andy Salnikov added a comment - Returning to this ticket for few more quick tests. I have just tried to increase the granularity of the regions (make them smaller) when doing spatial selection, for both DiaObjectLast and DiaSource. This had very negative effect on Oracle, bloody thing just froze and needed 100 times longer to run those queries. Restoring the granularity back to previous setting made it to behave reasonably again. Few more details - the granularity setting works by specifying number of ranges when calculating the envelope for a spacial region: pixelator = HtmPixelization(HTM_LEVEL) pixel_ranges = pixelator.envelope(region, NUM_RANGES) The NUM_RANGES above is what determines maximum number of index ranges on the output and as the result it can be used to control granularity. The query is generated from that set of pixel ranges would look like (in meta-language): SELECT * FROM DiaObjectLast WHERE pixelId BETWEEN :pixel_ranges[0][0] AND :pixel_ranges[0][1] OR pixelId BETWEEN :pixel_ranges[1][0] AND :pixel_ranges[1][1] OR pixelId BETWEEN :pixel_ranges[2][0] AND :pixel_ranges[2][1] OR ... and so on All my previous tests were done with NUM_RANGES=64, here I tried to use NUM_RANGES=256 and that causes Oracle meltdown, likely an optimizer has difficult time to do something with WHERE containing 256 sub-expressions.
            Hide
            salnikov Andy Salnikov added a comment -

            I ran ap_proto with few different values of NUM_RANGES to understand how it affects selection time and selection efficiency, results are in the notebook (link is at the top).

            I tried three values for HTM_MAX_RANGES parameter- 64, 32, and 128 (all previous measurements were done with 64). Here are few observations

            Selection efficiency

            Measured as a fraction of DiaObjects that fall into a CCD region compared to number of objects returned from database query. Query envelope should be larger in general but we want it as close as possible to actual region. Here is the measured efficiency for different settings:

             32: 0.68
             64: 0.79
            128: 0.87
            

            The overhead is not too terrible, for our default 64 regions it's about 20%.

            Here is the plot which shows this efficiency (colors correspond to different ranges settings):

            Time DiaSource select (that we want to optimize) shows some dependency on ranges setting but it's not very prominent (32 is worse than 64, but 128 is not too much better than 64)

             And total visit time show some weird behavior for 128:

            This is probably the same effect as I saw in previous comment with 256, though 128 is not too bad compared to 256. Interestingly DiaSource select does not show any bad behavior, but DiaObject select has similar picture for 128 though it's not as prominent. This is probably has something to do with concurrency (or lack of it).

            Show
            salnikov Andy Salnikov added a comment - I ran ap_proto with few different values of NUM_RANGES to understand how it affects selection time and selection efficiency, results are in the notebook (link is at the top). I tried three values for HTM_MAX_RANGES parameter- 64, 32, and 128 (all previous measurements were done with 64). Here are few observations Selection efficiency Measured as a fraction of DiaObjects that fall into a CCD region compared to number of objects returned from database query. Query envelope should be larger in general but we want it as close as possible to actual region. Here is the measured efficiency for different settings: 32: 0.68 64: 0.79 128: 0.87 The overhead is not too terrible, for our default 64 regions it's about 20%. Here is the plot which shows this efficiency (colors correspond to different ranges settings): Time DiaSource select (that we want to optimize) shows some dependency on ranges setting but it's not very prominent (32 is worse than 64, but 128 is not too much better than 64)  And total visit time show some weird behavior for 128: This is probably the same effect as I saw in previous comment with 256, though 128 is not too bad compared to 256. Interestingly DiaSource select does not show any bad behavior, but DiaObject select has similar picture for 128 though it's not as prominent. This is probably has something to do with concurrency (or lack of it).
            Hide
            salnikov Andy Salnikov added a comment -

            I think it is time to close this ticket and declare it a partial failure (or partial success).

            Summary of things that I learned:

            • region-based select does not improve performance for DiaSource, instead it makes it somewhat slower
            • for region-based select for DiaObjects current setting of number of ranges in query (64) seems to be close to optimal

            Before trying other strategies for speeding up DiaSource access I'd like to re-architect my ap_proto script a little and make it run on many hosts (could also help us with cloudy things). After that I can try to implement partitioning and/or pre-fetching to see how that can help us. Ultimately I'd like to try in-memory tables but it may be not trivial with Oracle.

            Show
            salnikov Andy Salnikov added a comment - I think it is time to close this ticket and declare it a partial failure (or partial success). Summary of things that I learned: region-based select does not improve performance for DiaSource, instead it makes it somewhat slower for region-based select for DiaObjects current setting of number of ranges in query (64) seems to be close to optimal Before trying other strategies for speeding up DiaSource access I'd like to re-architect my ap_proto script a little and make it run on many hosts (could also help us with cloudy things). After that I can try to implement partitioning and/or pre-fetching to see how that can help us. Ultimately I'd like to try in-memory tables but it may be not trivial with Oracle.
            Hide
            salnikov Andy Salnikov added a comment -

            Nothing to review here, closing.

            Show
            salnikov Andy Salnikov added a comment - Nothing to review here, closing.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Salnikov, Colin Slater, Fritz Mueller, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.