Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14712

PPDB Performance test on NCSA Oracle instances

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: L1 Database
    • Labels:
      None

      Description

      Run ap_proto against Oracle instance at NCSA and extract some performance numbers from these tests.

        Attachments

          Issue Links

            Activity

            No builds found.
            vaikunth Vaikunth Thukral created issue -
            vaikunth Vaikunth Thukral made changes -
            Field Original Value New Value
            Epic Link DM-14711 [ 113996 ]
            vaikunth Vaikunth Thukral made changes -
            Risk Score 0
            salnikov Andy Salnikov made changes -
            Status To Do [ 10001 ] In Progress [ 3 ]
            salnikov Andy Salnikov made changes -
            Description Run ap_proto against Oracle instance at NCSA and extract some performance numbers from these tests.
            Hide
            salnikov Andy Salnikov added a comment -

            One potential issue that may be important for these tests is the quality of implementation for cx_Oracle (this is the driver used by sqlalchemy). Quick look at the code shows one potentially problematic thing:

            This implies that queries that return many rows in ap_proto will not run efficiently. One potential way to fix it is to patch ODPI and rebuild it (liekly together with cx_Oracle).

            Show
            salnikov Andy Salnikov added a comment - One potential issue that may be important for these tests is the quality of implementation for cx_Oracle (this is the driver used by sqlalchemy). Quick look at the code shows one potentially problematic thing: cx_Oracle ( https://oracle.github.io/python-cx_Oracle/ ) uses Oracle ODPI-C library ( https://oracle.github.io/odpi/ ) which is a "simplified" wrapper for complicated OCI library One issue with OCI that I know of is inadequate defaults for SELECTs returning many (thousands) of rows, Oracle has a "prefetch buffer" for returned rows, but its size by default is very low (2) meaning that it needs a lot of round-trips to server to retrieve all rows. In OCI the size of the prefetch buffer can be adjusted using OCI_ATTR_PREFETCH_ROWS and OCI_ATTR_PREFETCH_MEMORY attributes of statement ODPI does not have a way to do same adjustments, in fact it explicitly sets prefetch buffer size to 2 rows: https://github.com/oracle/odpi/blob/37a93bfeeb3ea8a2dcab333acab52fe0f7d58080/src/dpiStmt.c#L559-L569 This implies that queries that return many rows in ap_proto will not run efficiently. One potential way to fix it is to patch ODPI and rebuild it (liekly together with cx_Oracle).
            salnikov Andy Salnikov made changes -
            Attachment ora-1-def-5x5.png [ 33093 ]
            salnikov Andy Salnikov made changes -
            Attachment ora-2-ssd-1x1.png [ 33094 ]
            salnikov Andy Salnikov made changes -
            Attachment ora-3-ssd-5x5.png [ 33095 ]
            salnikov Andy Salnikov made changes -
            Attachment ora-4-spin-5x5.png [ 33096 ]
            salnikov Andy Salnikov made changes -
            Attachment pg-ssd-1x1.png [ 33099 ]
            salnikov Andy Salnikov made changes -
            Attachment pg-ssd-5x5.png [ 33100 ]
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I ran small bunch of tests with Oracle instance, here are some details with plots.

            Oracle instance is configured with two tablespaces - LSST_ALERTS_SSD and LSST_ALERTS_SPIN, default tablespace is LSST_ALERTS_SPIN.

            I ran my first test with default tablespace (at that time I though it should be LSST_ALERTS_SSD but it was LSST_ALERTS_SPIN). The test was mostly to check that things look reasonable after all changes that I needed to make to ap_proto for Oracle compatibility. The test was done with 5x5 tiling, meaning there are 25 process reading/writing in parallel. Here is the plot for per-visit total processing time:

            The numbers looks surprisingly low for spinning disks. there are occasional outliers but there are not too many of those. Not bad for very first run.

            This are times when tests were run (with some interruptions), all times are NCSA timezone.

            First visit Last Visit Start time End time
            1 395 Unknown Logs lost
            396 1395 2018-06-01 13:26:17 2018-06-01 14:31:00
            1396  2395 2018-06-08 18:46:47  2018-06-08 20:05:23
            2396 4395 2018-06-10 01:10:51 2018-06-10 05:01:00
            4396 9395 2018-06-10 10:19:04 2018-06-11 03:59:23

             

            For the next test I switched to SSD, providing explicit LSST_ALERTS_SSD tablespace for all tables/indices. I also decided to do this test with a single "tile" to compare it to the numbers that we got from Postgres. Here is the plot an earlier plot from Postgres (Visit range is different):


            Oracle seem to show worse performance here than what we got from Postgres, though range of visits is very short, hard to extrapolate it to large scales.

            Times for second series of tests:

            First visit Last Visit Start time End time
            1 4078 2018-06-11 17:13:52 2018-06-12 22:33:58

            Third test was for 5x5 tiling on SSD, this is again to compare directly against Postgres (and first Oracle test spinning disks). Here is again Oracle plot and Postgres plot:

            And for this test Oracle performance does not look good. There are also regions of significantly worse performance.

            Times for third series of tests:

            First visit Last Visit Start time End time
            1 1000 2018-06-13 00:57:26 2018-06-13 06:21:11
            1001 2000 2018-06-13 08:42:31 2018-06-13 20:17:10

            Last test was to switch back to spinning disks with 5x5 tiling to see if I can reproduce my first test:

            And this shows some weird behavior in the initial region (plus few groups of outliers).

            Times for the last series of tests:

            First visit Last Visit Start time End time
            1 1000 2018-06-14 00:41:42 2018-06-14 06:24:09
            1001 2000 2018-06-14 12:26:31 2018-06-14 21:48:34
            2001 3000 2018-06-15 11:46:55 2018-06-15 13:31:55
            3001 6000 2018-06-15 15:18:16 2018-06-15 23:22:17
            6001 14000 2018-06-16 01:02:28 2018-06-17 15:37:08

             

            So interesting questions about this:

            • why SSD performance looks worse than spinning disks
            • why spinning disk performance looks too good
            • why we see those bunches of outliers, could it be because client machine gets overloaded

            I'll try to make more detailed plots (for individual types of queries) to see if there is anything interesting there. And we also need to get more details about Oracle setup, so it's time to talk to Chris.

            Show
            salnikov Andy Salnikov added a comment - - edited I ran small bunch of tests with Oracle instance, here are some details with plots. Oracle instance is configured with two tablespaces - LSST_ALERTS_SSD and LSST_ALERTS_SPIN, default tablespace is LSST_ALERTS_SPIN. I ran my first test with default tablespace (at that time I though it should be LSST_ALERTS_SSD but it was LSST_ALERTS_SPIN). The test was mostly to check that things look reasonable after all changes that I needed to make to ap_proto for Oracle compatibility. The test was done with 5x5 tiling, meaning there are 25 process reading/writing in parallel. Here is the plot for per-visit total processing time: The numbers looks surprisingly low for spinning disks. there are occasional outliers but there are not too many of those. Not bad for very first run. This are times when tests were run (with some interruptions), all times are NCSA timezone. First visit Last Visit Start time End time 1 395 Unknown Logs lost 396 1395 2018-06-01 13:26:17 2018-06-01 14:31:00 1396  2395 2018-06-08 18:46:47  2018-06-08 20:05:23 2396 4395 2018-06-10 01:10:51 2018-06-10 05:01:00 4396 9395 2018-06-10 10:19:04 2018-06-11 03:59:23   For the next test I switched to SSD, providing explicit LSST_ALERTS_SSD tablespace for all tables/indices. I also decided to do this test with a single "tile" to compare it to the numbers that we got from Postgres. Here is the plot an earlier plot from Postgres (Visit range is different): Oracle seem to show worse performance here than what we got from Postgres, though range of visits is very short, hard to extrapolate it to large scales. Times for second series of tests: First visit Last Visit Start time End time 1 4078 2018-06-11 17:13:52 2018-06-12 22:33:58 Third test was for 5x5 tiling on SSD, this is again to compare directly against Postgres (and first Oracle test spinning disks). Here is again Oracle plot and Postgres plot: And for this test Oracle performance does not look good. There are also regions of significantly worse performance. Times for third series of tests: First visit Last Visit Start time End time 1 1000 2018-06-13 00:57:26 2018-06-13 06:21:11 1001 2000 2018-06-13 08:42:31 2018-06-13 20:17:10 Last test was to switch back to spinning disks with 5x5 tiling to see if I can reproduce my first test: And this shows some weird behavior in the initial region (plus few groups of outliers). Times for the last series of tests: First visit Last Visit Start time End time 1 1000 2018-06-14 00:41:42 2018-06-14 06:24:09 1001 2000 2018-06-14 12:26:31 2018-06-14 21:48:34 2001 3000 2018-06-15 11:46:55 2018-06-15 13:31:55 3001 6000 2018-06-15 15:18:16 2018-06-15 23:22:17 6001 14000 2018-06-16 01:02:28 2018-06-17 15:37:08   So interesting questions about this: why SSD performance looks worse than spinning disks why spinning disk performance looks too good why we see those bunches of outliers, could it be because client machine gets overloaded I'll try to make more detailed plots (for individual types of queries) to see if there is anything interesting there. And we also need to get more details about Oracle setup, so it's time to talk to Chris.
            cs2018 Christopher Stephens made changes -
            Attachment lsst_ap_awr_reports.tar.gz [ 33130 ]
            Hide
            cs2018 Christopher Stephens added a comment - - edited

            I've uploaded a set of AWR reports that capture each test run listed above with start and end snapshot times as close to the test start and end times as possible. We take AWR snapshots hourly so they don't match up very well but there is no other activity on the system right now so it shouldn't be a big deal. I've named them according to disk type (ssd vs spin), tiling, and start visit value.

            All of the reports show relatively idle CPU for each host ("OS Statistics By Instance" section) in the RAC cluster. The tests performed on spinning disks show I/O to be a significant contributor to overall response time ("Top Timed Events" -> db file sequential read event) while I/O isn't a significant contributor to response time for tests performed on SSD disks. We ran benchmarking software prior to opening up the environment for testing and the I/O times match closely with what we saw then.

            On one of the SSD tests "cursor: pin S wait on X" accounted for roughly 30% of response time. That will require further investigation.

            Overall, CPU accounts for most of the response time for the SSD runs.  Since there appears to have been plenty of idle CPU on the hosts during the tests, my initial impression is that we should also look at other areas to try and identify the reason for the poor results.  That is not to say there aren't significant optimization opportunities in the database.

            Some other quick comments:

            It looks like database connections are not being reused which can cause unnecessary delays and worse in the case of Oracle 12.2. We need to explore connection pooling.

            The SQL uses bind variables but it looks like the names of those variables change so we still end up with unnecessary hard parsing.

            We see slow inserts on tables with BLOBs. We need to make sure those are all necessary for the critical alert processing and explore optimizations.

            Object names are created with quotes resulting in case sensitive table names. In my experience that gets to be frustrating to deal with. Hopefully that can be changed without much effort.

            I think the fact that the software worked as expected on first series of tests is fantastic. I think we can do things to speed this up significantly but we need a strategy with more testing and much more time analyzing the results.

            A final comment: This was a rushed review in an attempt to post something useful prior to our call June 21.  Apologies for any mistakes or misleading comments.

            Show
            cs2018 Christopher Stephens added a comment - - edited I've uploaded a set of AWR reports that capture each test run listed above with start and end snapshot times as close to the test start and end times as possible. We take AWR snapshots hourly so they don't match up very well but there is no other activity on the system right now so it shouldn't be a big deal. I've named them according to disk type (ssd vs spin), tiling, and start visit value. All of the reports show relatively idle CPU for each host ("OS Statistics By Instance" section) in the RAC cluster. The tests performed on spinning disks show I/O to be a significant contributor to overall response time ("Top Timed Events" -> db file sequential read event) while I/O isn't a significant contributor to response time for tests performed on SSD disks. We ran benchmarking software prior to opening up the environment for testing and the I/O times match closely with what we saw then. On one of the SSD tests "cursor: pin S wait on X" accounted for roughly 30% of response time. That will require further investigation. Overall, CPU accounts for most of the response time for the SSD runs.  Since there appears to have been plenty of idle CPU on the hosts during the tests, my initial impression is that we should also look at other areas to try and identify the reason for the poor results.  That is not to say there aren't significant optimization opportunities in the database. Some other quick comments: It looks like database connections are not being reused which can cause unnecessary delays and worse in the case of Oracle 12.2. We need to explore connection pooling. The SQL uses bind variables but it looks like the names of those variables change so we still end up with unnecessary hard parsing. We see slow inserts on tables with BLOBs. We need to make sure those are all necessary for the critical alert processing and explore optimizations. Object names are created with quotes resulting in case sensitive table names. In my experience that gets to be frustrating to deal with. Hopefully that can be changed without much effort. I think the fact that the software worked as expected on first series of tests is fantastic. I think we can do things to speed this up significantly but we need a strategy with more testing and much more time analyzing the results. A final comment: This was a rushed review in an attempt to post something useful prior to our call June 21.  Apologies for any mistakes or misleading comments.
            Hide
            salnikov Andy Salnikov added a comment -

            Few things that we we discussed at today's meeting, just to keep records here:

            • sorry I did not answer your question above earlier, too many other things got in the way, here it is anyways:
              • Case-sensitivity of tables/columns/databases - all names are defined in a centrally-managed schema (out cat package) and I prefer not to mess with the renaming at least for now as it may cause issues for other users of the same schema. I know it's complicated topic but try to stay backend-independent for now, if it proves to be an issue we'll need to talk to everyone involved and decide what to do at that point.
              • Session/connection reuse - this is a feature of my current prototype implementation which forks a number of workers for each visit and workers terminate after finishing processing. OTOH I think that this is how actual production system is going to work as well. If we could improve performance significantly by reusing connections on client side then we should definitely look into that, but we need some numbers first.
            • Before meeting we saw significant improvement of performance after Chris replaced primary key index on DiaObjectLast table with a wider index which also included all columns from SELECT clause. Explanation for this improvement is that this query can be satisfied from index data only and it does not need an extra read from table segment. To me this sounds a bit fragile for cases when we need to select more columns (arr all columns).
              • After the meeting I was thinking about that and was wondering if IOT could help as here. I know that table is probably too wide for typical IOT use case but we can at least test it.
            • I'll run prototype to generate many more visits to see how it scales further, will add more plots

             

            Show
            salnikov Andy Salnikov added a comment - Few things that we we discussed at today's meeting, just to keep records here: sorry I did not answer your question above earlier, too many other things got in the way, here it is anyways: Case-sensitivity of tables/columns/databases - all names are defined in a centrally-managed schema (out cat package) and I prefer not to mess with the renaming at least for now as it may cause issues for other users of the same schema. I know it's complicated topic but try to stay backend-independent for now, if it proves to be an issue we'll need to talk to everyone involved and decide what to do at that point. Session/connection reuse - this is a feature of my current prototype implementation which forks a number of workers for each visit and workers terminate after finishing processing. OTOH I think that this is how actual production system is going to work as well. If we could improve performance significantly by reusing connections on client side then we should definitely look into that, but we need some numbers first. Before meeting we saw significant improvement of performance after Chris replaced primary key index on DiaObjectLast table with a wider index which also included all columns from SELECT clause. Explanation for this improvement is that this query can be satisfied from index data only and it does not need an extra read from table segment. To me this sounds a bit fragile for cases when we need to select more columns (arr all columns). After the meeting I was thinking about that and was wondering if IOT could help as here. I know that table is probably too wide for typical IOT use case but we can at least test it. I'll run prototype to generate many more visits to see how it scales further, will add more plots  
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            Hide
            salnikov Andy Salnikov added a comment -

            I ran a bunch of tests on Oracle in different configurations after moving client application to verification cluster. Here is brif summary of what we have seen so far. There is also some info in my notebook which has plenty of graphs and details (and it is regularly updated): https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-14712%20Oracle%20initial%20test.ipynb

            We were seeing some significant spikes in visit processing time, one type of these spikes happened regularly when job starts and last for few tens of visits. They look like short spikes in range between 25-50 seconds on this plot:

            Chris says that they are likely related to unexpected messages observed on Oracle side (cursor: pin S wait on X) which originate from queries on table/column metadata. These seems to be generated by sqlalchemy when it loads table metadata from database. To suppress those queries I have replaced metadata reading from database with metadata built from configuration. This improved situation dramatically, that kind of spikes had disappeared entirely.

            There is still some other interesting effect that was observed in two latest runs with 15x15 tiling (which makes about 200 concurrent clients). For the first several hundred of visits performance drops very quickly and then it almost suddenly improves:

            For that run I had also taken snapshots of graphana load plot and it reflects very much what happens with performance:

            Plots for network traffic show very similar behavior. For some reason performance degrades very quickly and then decides to improve itself. This is not likely to be related to client side issues as it had repeated for two separate runs in very similar pattern. It could be related to query optimizer which needs to collect some statistics to learn how to build an optimal plan. If that is true we could try to give it some hint.

            There are also few occasional outliers on the plot above in the range above 60 sec. Would be interesting to understand what causes those, or at least whether this is due to client side or Oracle.

            Show
            salnikov Andy Salnikov added a comment - I ran a bunch of tests on Oracle in different configurations after moving client application to verification cluster. Here is brif summary of what we have seen so far. There is also some info in my notebook which has plenty of graphs and details (and it is regularly updated):  https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-14712%20Oracle%20initial%20test.ipynb We were seeing some significant spikes in visit processing time, one type of these spikes happened regularly when job starts and last for few tens of visits. They look like short spikes in range between 25-50 seconds on this plot: Chris says that they are likely related to unexpected messages observed on Oracle side (cursor: pin S wait on X) which originate from queries on table/column metadata. These seems to be generated by sqlalchemy when it loads table metadata from database. To suppress those queries I have replaced metadata reading from database with metadata built from configuration. This improved situation dramatically, that kind of spikes had disappeared entirely. There is still some other interesting effect that was observed in two latest runs with 15x15 tiling (which makes about 200 concurrent clients). For the first several hundred of visits performance drops very quickly and then it almost suddenly improves: For that run I had also taken snapshots of graphana load plot and it reflects very much what happens with performance: Plots for network traffic show very similar behavior. For some reason performance degrades very quickly and then decides to improve itself. This is not likely to be related to client side issues as it had repeated for two separate runs in very similar pattern. It could be related to query optimizer which needs to collect some statistics to learn how to build an optimal plan. If that is true we could try to give it some hint. There are also few occasional outliers on the plot above in the range above 60 sec. Would be interesting to understand what causes those, or at least whether this is due to client side or Oracle.
            vaikunth Vaikunth Thukral made changes -
            Sprint DB_F18_06 [ 744 ] DB_F18_06, DB_F18_07 [ 744, 758 ]
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            Attachment image-2018-07-09-10-56-11-980.png [ 33355 ]
            salnikov Andy Salnikov made changes -
            Attachment image-2018-07-09-10-56-11-980.png [ 33355 ]
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Test with parallel processing using spinning disks

            I wanted to re-check how Oracle would behave in case of multiple parallel jobs with spinning disk setup. Naturally I expect (much) worse performance compared to SSD, but the results look very different from my expectations so I'm not really sure what I am measuring.

            Here is plot for visit time as a function of visit number:

            The initial section of the plot shows steep increase in processing time, somewhat that is expected. There was an Oracle (or networking) intervention after visit 765 for which ap_proto was stopped. After intervention finished and job restarted magically things started running much faster.

            Here is graphana node load plot which covers whole period of running of this job (times on this plot are in Pacific time):

            There is clear initial period (7/1 22:48 to 7/2 12:36) with very low load which corresponds to initial section of the above plot. Two other periods (7/2 15:16 to 7/2 16:32 and 7/2 23:15 to 7/3 8:13) correspond to better performance and show significantly higher load. There is also some strange oscillation effect observed after 7/3 3:00 on the plot.

            Here is for completeness outbound traffic from those tree periods:

            Show
            salnikov Andy Salnikov added a comment - - edited Test with parallel processing using spinning disks I wanted to re-check how Oracle would behave in case of multiple parallel jobs with spinning disk setup. Naturally I expect (much) worse performance compared to SSD, but the results look very different from my expectations so I'm not really sure what I am measuring. Here is plot for visit time as a function of visit number: The initial section of the plot shows steep increase in processing time, somewhat that is expected. There was an Oracle (or networking) intervention after visit 765 for which ap_proto was stopped. After intervention finished and job restarted magically things started running much faster. Here is graphana node load plot which covers whole period of running of this job (times on this plot are in Pacific time): There is clear initial period (7/1 22:48 to 7/2 12:36) with very low load which corresponds to initial section of the above plot. Two other periods (7/2 15:16 to 7/2 16:32 and 7/2 23:15 to 7/3 8:13) correspond to better performance and show significantly higher load. There is also some strange oscillation effect observed after 7/3 3:00 on the plot. Here is for completeness outbound traffic from those tree periods:
            Hide
            cs2018 Christopher Stephens added a comment -

            Current Oracle RAC configuration hardware details:

             
            3 Dual Intel Xeon E5-2650 compute nodes

            • 128GB memory
            • 10Gbps network connectivity to both public and private interconnect networks
            • 16Gbps fibre channel card for storage connectivity

            1 Netapp storage array

            • 16Gbps fibre channel connectivity
            • 50 8TB HDDs
            • 10 1.5TB SSDs
            • RAID 10
               
              1 16Gbps Cisco fibre channel switch
            • all nodes and the Netapp array connected through this switch
               
              2 10Gbps Juniper network switches
            • one for public and one for private interconnect network
            Show
            cs2018 Christopher Stephens added a comment - Current Oracle RAC configuration hardware details:   3 Dual Intel Xeon E5-2650 compute nodes 128GB memory 10Gbps network connectivity to both public and private interconnect networks 16Gbps fibre channel card for storage connectivity 1 Netapp storage array 16Gbps fibre channel connectivity 50 8TB HDDs 10 1.5TB SSDs RAID 10   1 16Gbps Cisco fibre channel switch all nodes and the Netapp array connected through this switch   2 10Gbps Juniper network switches one for public and one for private interconnect network
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            Risk Score 0 1
            salnikov Andy Salnikov made changes -
            Hide
            salnikov Andy Salnikov added a comment -

            Few more tests were done to understand how we can fix that bootstrap slowdown that appears on all graphs.

            Apparently this slowdown is caused by query optimizer choosing incorrect execution plan for query on DiaObjectLast table. With time it gets enough stats and switches to a better plan, this is what we see on the graphs.

            First thing that I tried is to give an optimizer a hint in a query itself for using a PK index, e.g. SELECT /*+ index("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ "X15_DiaObjectLast"."diaObjectId" .... That unfortunately did not hae any effect, here is the plot for this attempt:

            For next run Chris has dumped stats table and gave me a SQL script which re-populates stats to run after tables are re-created. This worked much better, the peak completely disappeared:

            The script that pre-loads table stats is not comprehensible of course, I'm not sure that we can use it as it is, it might stop working ion any change in schema or query. I still would like to have easier/simpler method of convincing Oracle to do what it is supposed to do.

            Show
            salnikov Andy Salnikov added a comment - Few more tests were done to understand how we can fix that bootstrap slowdown that appears on all graphs. Apparently this slowdown is caused by query optimizer choosing incorrect execution plan for query on DiaObjectLast table. With time it gets enough stats and switches to a better plan, this is what we see on the graphs. First thing that I tried is to give an optimizer a hint in a query itself for using a PK index, e.g. SELECT /*+ index("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ "X15_DiaObjectLast"."diaObjectId" ... . That unfortunately did not hae any effect, here is the plot for this attempt: For next run Chris has dumped stats table and gave me a SQL script which re-populates stats to run after tables are re-created. This worked much better, the peak completely disappeared: The script that pre-loads table stats is not comprehensible of course, I'm not sure that we can use it as it is, it might stop working ion any change in schema or query. I still would like to have easier/simpler method of convincing Oracle to do what it is supposed to do.
            Hide
            salnikov Andy Salnikov added a comment -

            Couple of observations regarding execution plan

            As Chris noticed already that initial performance drop is due to optimizer choosing INDEX FULL SCAN plan for DiaObjectLast SELECT over more efficient INDEX RANGE SCAN (with UNION over set of multiple ranges). Pre-loading statistics gathered in previous tests seems to lock optimizer into the right decision. I tried to re-run test again couple of times with pre-loaded stats and it works consistently. OTOH, I do not think we can use that as a production tool, as there is not guarantee this is going to work e.g. with different indices.

            I have also tried to re-run the test without pre-loading stats and in one case optimizer could not switch to a better plan even after very long run (usually it switches after ~1500 visits, but in this case it continued using bad plan after 3000+ visits). So even gathering stats that can be pre-loaded later may  be problematic.

            Show
            salnikov Andy Salnikov added a comment - Couple of observations regarding execution plan As Chris noticed already that initial performance drop is due to optimizer choosing INDEX FULL SCAN plan for DiaObjectLast SELECT over more efficient INDEX RANGE SCAN (with UNION over set of multiple ranges). Pre-loading statistics gathered in previous tests seems to lock optimizer into the right decision. I tried to re-run test again couple of times with pre-loaded stats and it works consistently. OTOH, I do not think we can use that as a production tool, as there is not guarantee this is going to work e.g. with different indices. I have also tried to re-run the test without pre-loading stats and in one case optimizer could not switch to a better plan even after very long run (usually it switches after ~1500 visits, but in this case it continued using bad plan after 3000+ visits). So even gathering stats that can be pre-loaded later may  be problematic.
            salnikov Andy Salnikov made changes -
            Risk Score 1 0
            salnikov Andy Salnikov made changes -
            Hide
            salnikov Andy Salnikov added a comment -

            Index-Organized Tables

            One more test that I have done is to try to use IOT for DiaObjectLast table. WE have already seen that data has to be in index to get reasonable performance from SELECT on that table. If we have to select many more columns from the table it means effectively that most or all columns have to be stored in t he index, and IOT should the most natural approach in this case.

            Unfortunately performance with IOT drops to unacceptable level. Here is the plot for total visit processing time with IOT:

            The performance on the plot above varies varies between bad and very bad. I was checking optimizer info and I think that very bad performance is again due to INDEX FULL SCAN and slightly better but still inferior performance (lower band on the plot) is when optimizer switches to INDEX RANGE SCAN plan.

            It would be interesting to understand how performance scales with the number of columns in the index, plots above (with ~good performance) were obtained with selecting just 8 columns from a table. That may be or may nor be sufficient for Object/Source association, I'll need to get more details from AP people. I'll try to re-run the tests without IOT but with larger number of columns on SELECT and in index.

            Show
            salnikov Andy Salnikov added a comment - Index-Organized Tables One more test that I have done is to try to use IOT for DiaObjectLast table. WE have already seen that data has to be in index to get reasonable performance from SELECT on that table. If we have to select many more columns from the table it means effectively that most or all columns have to be stored in t he index, and IOT should the most natural approach in this case. Unfortunately performance with IOT drops to unacceptable level. Here is the plot for total visit processing time with IOT: The performance on the plot above varies varies between bad and very bad. I was checking optimizer info and I think that very bad performance is again due to INDEX FULL SCAN and slightly better but still inferior performance (lower band on the plot) is when optimizer switches to INDEX RANGE SCAN plan. It would be interesting to understand how performance scales with the number of columns in the index, plots above (with ~good performance) were obtained with selecting just 8 columns from a table. That may be or may nor be sufficient for Object/Source association, I'll need to get more details from AP people. I'll try to re-run the tests without IOT but with larger number of columns on SELECT and in index.
            Hide
            salnikov Andy Salnikov added a comment -

            Christopher Stephens, thanks for machine/storage specs! Could you give a bit more info about Netapp storage, in particular does it have any kind of internal cache (SDRAM/NVMe), cache size, etc. Or maybe just give me a model number, I'll try to look it up on Netapp site.

            Show
            salnikov Andy Salnikov added a comment - Christopher Stephens , thanks for machine/storage specs! Could you give a bit more info about Netapp storage, in particular does it have any kind of internal cache (SDRAM/NVMe), cache size, etc. Or maybe just give me a model number, I'll try to look it up on Netapp site.
            Hide
            cs2018 Christopher Stephens added a comment - - edited

            The performance issues surrounding unstable plans for SQL against DiaObjectLast were originally the result of Oracle not having a sufficiently efficient access path available to it. We solved that with a better index.

            After that, Andy consistently noticed slow initial performance which switched to much better performance over the course of a test run. This occurs because the database schema used by the software is dropped and recreated for each test which means that Oracle doesn't have any way to compare possible execution plans for various SQL's. Since ~version 8i, Oracle switched the default optimizer from rule based to cost based and that switch mandates representative statistics be collected so Oracle knows the shape of the data its being asked to manage.

            I'm unsure how big a problem missing statistics actually is. Its my understanding that this database will only run in the consolidated database so we don't have to worry too much about newly provisioned environments. Regardless, the Oracle database will always need representative statistics in order to come up with efficient execution plans.

            Andy has expressed some concern that importing statistics as part of provisioning a new environment isn't a long term solution since schema edits will occur as development continues and the statistics script will have to be regularly regenerated. I understand the hesitation since the scripts contents are very Oracle-centric and not very accessible to reading. However, generating the script is very easy and importing statistics into an empty table / schema / database is a commonly used approach. I think that, over time, everyone will see this is an essential step given how Oracle works.

            Andy is right that seeding statistics doesn't lock in any particular execution plan and is concerned that the performance issue might return over time.  That's true of all SQL hitting the database and is a very good reason for developing a solid statistics gathering and SQL plan management strategy.  As Andy mentioned above, he did explore the use of SQL hints to try and convince Oracle to use the "right" plan without access to statistics. That can sometimes work but hints never guarantee an execution plan for anything but the simplest SQL. To guarantee a plan you need a full set of hints that disable all other possible execution plans. Taking this approach is impractical. The necessary hints would change with each new available access path, software patch, upgrade, etc. 

            If plan stability is a concern (and it should be), Oracle has SQL Plan Management feature that will allow us to lock down performance and only evolve SQL execution plans when we have verified new plans perform better. This is another Oracle-centric approach that we will likely have to take to separate performance from other database products without some of these features. As the schema evolves and SQL changes, we will have to keep up with those changes but that is all accounted for in the SPM functionality.

            Show
            cs2018 Christopher Stephens added a comment - - edited The performance issues surrounding unstable plans for SQL against DiaObjectLast were originally the result of Oracle not having a sufficiently efficient access path available to it. We solved that with a better index. After that, Andy consistently noticed slow initial performance which switched to much better performance over the course of a test run. This occurs because the database schema used by the software is dropped and recreated for each test which means that Oracle doesn't have any way to compare possible execution plans for various SQL's. Since ~version 8i, Oracle switched the default optimizer from rule based to cost based and that switch mandates representative statistics be collected so Oracle knows the shape of the data its being asked to manage. I'm unsure how big a problem missing statistics actually is. Its my understanding that this database will only run in the consolidated database so we don't have to worry too much about newly provisioned environments. Regardless, the Oracle database will always need representative statistics in order to come up with efficient execution plans. Andy has expressed some concern that importing statistics as part of provisioning a new environment isn't a long term solution since schema edits will occur as development continues and the statistics script will have to be regularly regenerated. I understand the hesitation since the scripts contents are very Oracle-centric and not very accessible to reading. However, generating the script is very easy and importing statistics into an empty table / schema / database is a commonly used approach. I think that, over time, everyone will see this is an essential step given how Oracle works. Andy is right that seeding statistics doesn't lock in any particular execution plan and is concerned that the performance issue might return over time.  That's true of all SQL hitting the database and is a very good reason for developing a solid statistics gathering and SQL plan management strategy.  As Andy mentioned above, he did explore the use of SQL hints to try and convince Oracle to use the "right" plan without access to statistics. That can sometimes work but hints never guarantee an execution plan for anything but the simplest SQL. To guarantee a plan you need a full set of hints that disable all other possible execution plans. Taking this approach is impractical. The necessary hints would change with each new available access path, software patch, upgrade, etc.  If plan stability is a concern (and it should be), Oracle has SQL Plan Management feature that will allow us to lock down performance and only evolve SQL execution plans when we have verified new plans perform better. This is another Oracle-centric approach that we will likely have to take to separate performance from other database products without some of these features. As the schema evolves and SQL changes, we will have to keep up with those changes but that is all accounted for in the SPM functionality.
            Hide
            cs2018 Christopher Stephens added a comment -

            To summarize (the previous post was a little long-winded):

            We don't think initial slowness is an issue outside of the way we are testing the AP software. If initial performance of the system prior to the database collecting its own set of statistics and altering SQL execution plans accordingly is a concern, we need to give Oracle what it needs by importing representative statistics.

            Outside of initial slowness, we think performance has been relatively good and stable. We'd like to get confirmation on that.

            We feel strongly that embracing Oracle specific features when they are the best available solution to any given problem is the right approach.

            Show
            cs2018 Christopher Stephens added a comment - To summarize (the previous post was a little long-winded): We don't think initial slowness is an issue outside of the way we are testing the AP software. If initial performance of the system prior to the database collecting its own set of statistics and altering SQL execution plans accordingly is a concern, we need to give Oracle what it needs by importing representative statistics. Outside of initial slowness, we think performance has been relatively good and stable. We'd like to get confirmation on that. We feel strongly that embracing Oracle specific features when they are the best available solution to any given problem is the right approach.
            Hide
            salnikov Andy Salnikov added a comment -

            Chris, thanks a lot for explaining Oracle behavior. I understand that it can be fixed by providing representative statistics, my concern is that statistic collection can take longer than regular people would expect and that in this conditions database initialization and statistics gathering (or re-initialization after schema change) is a long and unpredictable process.

            In any case, this ticket is just for fact gathering and understanding the behavior, not for making any decisions. There are still many tests that I want to do, in particular long-running test to see how it behaves with many months worth of data to decide how we want to partition it.

            Regarding comparison with other databases (Chris asked me in private message) - if we ignore for a moment that initialization step then Oracle performance seems to be on par with what I have seen with Postgres. Direct comparison does not make sense, of course, as Postgres numbers were obtained on different hardware. Again goal of this ticket is to collect as much Oracle-related info as we can, all decisions and comparisons will be made later.

             

            Show
            salnikov Andy Salnikov added a comment - Chris, thanks a lot for explaining Oracle behavior. I understand that it can be fixed by providing representative statistics, my concern is that statistic collection can take longer than regular people would expect and that in this conditions database initialization and statistics gathering (or re-initialization after schema change) is a long and unpredictable process. In any case, this ticket is just for fact gathering and understanding the behavior, not for making any decisions. There are still many tests that I want to do, in particular long-running test to see how it behaves with many months worth of data to decide how we want to partition it. Regarding comparison with other databases (Chris asked me in private message) - if we ignore for a moment that initialization step then Oracle performance seems to be on par with what I have seen with Postgres. Direct comparison does not make sense, of course, as Postgres numbers were obtained on different hardware. Again goal of this ticket is to collect as much Oracle-related info as we can, all decisions and comparisons will be made later.  
            Hide
            cpond Christopher Pond added a comment -

            Andy, do you know the specs of the hardware you ran the Postgres tests on? I know you've got some info on our hardware, but might make more sense for Chris S and I to make a hardware comparison from our DB-centric point of view.

            Show
            cpond Christopher Pond added a comment - Andy, do you know the specs of the hardware you ran the Postgres tests on? I know you've got some info on our hardware, but might make more sense for Chris S and I to make a hardware comparison from our DB-centric point of view.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            For Postgres tests we used single node at CC-IN2P3 (kindly made available for our test on temporary basis). Hardware specs for that were:

            • CPU Intel(R) Xeon(R) CPU E5-2680 v4, 2.4GHz
            • 2x14 physical cores, 56 total threads
            • 512GB RAM
            • 2TB NVMe and 3TB SATA SSD
            • 7TB of spinning disks in RAID6 array

            Same node was used for hosting both database server and my prototype client (far from ideal but we did not have other options).

            Show
            salnikov Andy Salnikov added a comment - - edited For Postgres tests we used single node at CC-IN2P3 (kindly made available for our test on temporary basis). Hardware specs for that were: CPU Intel(R) Xeon(R) CPU E5-2680 v4, 2.4GHz 2x14 physical cores, 56 total threads 512GB RAM 2TB NVMe and 3TB SATA SSD 7TB of spinning disks in RAID6 array Same node was used for hosting both database server and my prototype client (far from ideal but we did not have other options).
            vaikunth Vaikunth Thukral made changes -
            Story Points 10 16
            vaikunth Vaikunth Thukral made changes -
            Sprint DB_F18_06, DB_F18_07 [ 744, 758 ] DB_F18_06, DB_F18_07, DB_F18_08 [ 744, 758, 771 ]
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            salnikov Andy Salnikov made changes -
            Hide
            salnikov Andy Salnikov added a comment -

            Testing narrow DiaObjectLast table with IOT

            In previous tests DiaObjectLast table schema was essentially the same as DiaObject schema. In principle we don't need all that info for association, only relatively small subset of columns may be needed for that. After talking to Colin we agreed that this subset may be sufficient (even if it needs to be extended it should not grow too large):

             

             - ra
             - decl
             - raSigma
             - declSigma
             - ra_decl_Cov
             - radecTai
             - pmRa
             - pmRaSigma
             - pmDecl
             - pmDeclSigma
             - parallax
             - parallaxSigma
             - pmRa_pmDecl_Cov
             - pmRa_parallax_Cov
             - pmDecl_parallax_Cov
            

            Having much narrower table should also help with IOT which is very desirable, we do want to avoid building additional indices.

            So I have implemented new schema for DiaObjectLast table and made it IOT. HEre are the results for relatively long run (30k visits) with some visits excluded from plots to make fits nicer. Excluded regions were range(10, 1283) and range(29352, 29517) (latter is when Oracle stats were dropped for few hours and then re-enabled).

            Here is per visit time, and for comparison time for similar configuration with Postgres:

            And the same plot with the linear fit:
             

             Fits are reasonably close, Oracle has slightly higher constant factor, but at this scale it does not too bad (meaning that we should expect lot more problems as we extend time scale).

             Here is also how timing for individual queries scale with number of visits (these times are average for individual per-CCD process, they do not add to the times above):

            Query Time @ visit=100 Time @ visit=30k
            Select DiaObjectLast 0.1 0.3
            Select DiaSource 0.43 4.2
            Select DiaForcedSource 0.01 1.0
            Insert DiaObjectLast 0.02 0.03
            Insert DiaObject 0.02 0.2
            Insert DiaSource 0.025 0.025
            Insert DiaForcedSource 0.02 0.1

            Clear that DiaSource select dominates as we go farther. One option to explore here is whether region-based select could work better for Oracle than ID-based select (we do some crazy WHERE objectId IN (...) with many literals in IN clause which Oracle may not like at all).

             

            Show
            salnikov Andy Salnikov added a comment - Testing narrow DiaObjectLast table with IOT In previous tests DiaObjectLast table schema was essentially the same as DiaObject schema. In principle we don't need all that info for association, only relatively small subset of columns may be needed for that. After talking to Colin we agreed that this subset may be sufficient (even if it needs to be extended it should not grow too large):   - ra - decl - raSigma - declSigma - ra_decl_Cov - radecTai - pmRa - pmRaSigma - pmDecl - pmDeclSigma - parallax - parallaxSigma - pmRa_pmDecl_Cov - pmRa_parallax_Cov - pmDecl_parallax_Cov Having much narrower table should also help with IOT which is very desirable, we do want to avoid building additional indices. So I have implemented new schema for DiaObjectLast table and made it IOT. HEre are the results for relatively long run (30k visits) with some visits excluded from plots to make fits nicer. Excluded regions were  range(10, 1283) and range(29352, 29517)  (latter is when Oracle stats were dropped for few hours and then re-enabled). Here is per visit time, and for comparison time for similar configuration with Postgres: And the same plot with the linear fit:    Fits are reasonably close, Oracle has slightly higher constant factor, but at this scale it does not too bad (meaning that we should expect lot more problems as we extend time scale).  Here is also how timing for individual queries scale with number of visits (these times are average for individual per-CCD process, they do not add to the times above): Query Time @ visit=100 Time @ visit=30k Select DiaObjectLast 0.1 0.3 Select DiaSource 0.43 4.2 Select DiaForcedSource 0.01 1.0 Insert DiaObjectLast 0.02 0.03 Insert DiaObject 0.02 0.2 Insert DiaSource 0.025 0.025 Insert DiaForcedSource 0.02 0.1 Clear that DiaSource select dominates as we go farther. One option to explore here is whether region-based select could work better for Oracle than ID-based select (we do some crazy WHERE objectId IN (...) with many literals in IN clause which Oracle may not like at all).  
            Hide
            salnikov Andy Salnikov added a comment - - edited

            More hints tests

            Tried once again to use hints with DiaObjectLast select to force it to do INDEX RANGE scan instead of FULL INDEX scan during the bootstrap stage. Here are variations that I tried:

            • SELECT /*+ INDEX_RS_ASC("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ...
            • SELECT /*+ INDEX_RS("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ...
            • SELECT /*+ INDEX_RS_ASC(@SEL$1 "X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ... (SEL$1 comes from analyzer when it chooses FULL INDEX)

            Nothing worked so far, Oracle looks unconvinced   or I'm doing it wrong.

            What is also interesting to me is that INDEX_RS hint does not appear in official Oracle docs, but there are some articles on the Internet which say that it should work. I'm not convinced that INDEX_RS is a thing.

            Show
            salnikov Andy Salnikov added a comment - - edited More hints tests Tried once again to use hints with DiaObjectLast select to force it to do INDEX RANGE scan instead of FULL INDEX scan during the bootstrap stage. Here are variations that I tried: SELECT /*+ INDEX_RS_ASC("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ... SELECT /*+ INDEX_RS("X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ... SELECT /*+ INDEX_RS_ASC(@SEL$1 "X15_DiaObjectLast" "X15_PK_DiaObjectLast") */ ... (SEL$1 comes from analyzer when it chooses FULL INDEX) Nothing worked so far, Oracle looks unconvinced   or I'm doing it wrong. What is also interesting to me is that INDEX_RS hint does not appear in official Oracle docs, but there are some articles on the Internet which say that it should work. I'm not convinced that INDEX_RS is a thing.
            Hide
            salnikov Andy Salnikov added a comment -

            Christopher Stephens, I asked above for details about Netapp storage array that you have in RAC. Do you have access to that info and can you share it with us?

             

            Show
            salnikov Andy Salnikov added a comment - Christopher Stephens , I asked above for details about Netapp storage array that you have in RAC. Do you have access to that info and can you share it with us?  
            Hide
            salnikov Andy Salnikov added a comment -

            For the record - Chris suggested couple of other hints to try with DiaObjectLast table:

            • /*+ dynamic_sampling(“X15_DiaObjectLast” 4) */
            • /*+ cardinality("X15_DiaObjectLast" 10000000) */

            None of these two changed optimizer behavior. For latter we also tried variation with large number of records:

            • /*+ cardinality("X15_DiaObjectLast" 1000000000) */

            and extra hint:

            • /*+ FIRST_ROWS_1 cardinality(“X15_DiaObjectLast” 1000000000) */

            without any improvement.

            Show
            salnikov Andy Salnikov added a comment - For the record - Chris suggested couple of other hints to try with DiaObjectLast table: /*+ dynamic_sampling(“X15_DiaObjectLast” 4) */ /*+ cardinality("X15_DiaObjectLast" 10000000) */ None of these two changed optimizer behavior. For latter we also tried variation with large number of records: /*+ cardinality("X15_DiaObjectLast" 1000000000) */ and extra hint: /*+ FIRST_ROWS_1 cardinality(“X15_DiaObjectLast” 1000000000) */ without any improvement.
            Hide
            salnikov Andy Salnikov added a comment -

            To summarize current situation with DiaObjectLast table:

            • With freshly initialized schema optimizer prefers (FAST) FULL INDEX SCAN which is significantly worse than INDEX RANGE SCAN plan.
            • It looks like optimizer needs to have significant volume of data in a table before it switches to a more efficient plan, I estimate some thing like 10-20 million rows. I think stats collection has to be enabled for that too.
            • We failed to find a way to force Oracle to lock into a better plan using query hints.
            • IOT works reasonably well if table has small number of columns, I think this is what we want for production.

            I think I'm going to close the ticket for now and switch to studying DiaSource performance on a new ticket.

            Christopher Pond, if you manage to find specs for storage array, please add those here even if ticket is closed.

            Show
            salnikov Andy Salnikov added a comment - To summarize current situation with DiaObjectLast table: With freshly initialized schema optimizer prefers (FAST) FULL INDEX SCAN which is significantly worse than INDEX RANGE SCAN plan. It looks like optimizer needs to have significant volume of data in a table before it switches to a more efficient plan, I estimate some thing like 10-20 million rows. I think stats collection has to be enabled for that too. We failed to find a way to force Oracle to lock into a better plan using query hints. IOT works reasonably well if table has small number of columns, I think this is what we want for production. I think I'm going to close the ticket for now and switch to studying DiaSource performance on a new ticket. Christopher Pond , if you manage to find specs for storage array, please add those here even if ticket is closed.
            salnikov Andy Salnikov made changes -
            Story Points 16 20
            Hide
            salnikov Andy Salnikov added a comment -

            Closing. Nothing to review, but please add anything relevant if you think it belongs here.

            Show
            salnikov Andy Salnikov added a comment - Closing. Nothing to review, but please add anything relevant if you think it belongs here.
            salnikov Andy Salnikov made changes -
            Resolution Done [ 10000 ]
            Status In Progress [ 3 ] Done [ 10002 ]
            salnikov Andy Salnikov made changes -
            Link This issue relates to DM-15476 [ DM-15476 ]

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              vaikunth Vaikunth Thukral
              Watchers:
              Andy Salnikov, Christopher Pond, Christopher Stephens, Colin Slater, Fritz Mueller, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.