Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23604

Implement cassandra monitoring for APDB tests

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      12
    • Sprint:
      DB_S20_02
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      We need to understand what is happening on server side in Cassandra when it serves APDB queries. Cassandra provides wide range of monitoring metrics (http://cassandra.apache.org/doc/latest/operating/metrics.html) and they are accessible with standard Java JMX tools and Cassandra-specific tools (nodetool command uses JMX to query/control Cassandra nodes).

      Some research is needed to understand if there is any existing tool that we can use to gather statistics that we need, or whether we need to implement something Java-based ourselves.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            I restarted testing after implementing everything in the list (1-month forced photometry cutoff was already there actually) and using more realistic camera geometry with 189 tiles which means that I only need 8 24-core machines. Collecting statistics now, will look at the results after couple of months of data is collected. Also I decided for this round of tests to spread the load equally between three machines, disk space on master02 should be enough for maybe 6 month or longer (without replication).

            Show
            salnikov Andy Salnikov added a comment - I restarted testing after implementing everything in the list (1-month forced photometry cutoff was already there actually) and using more realistic camera geometry with 189 tiles which means that I only need 8 24-core machines. Collecting statistics now, will look at the results after couple of months of data is collected. Also I decided for this round of tests to spread the load equally between three machines, disk space on master02 should be enough for maybe 6 month or longer (without replication).
            Hide
            salnikov Andy Salnikov added a comment -

            After restarting my test I see some random timeouts on Cassandra side even with small number of visits. I believe these are due to scaling issues that resulted in significantly larger number of queries in my current setup:

            • the number of clients grew from 81 to 189 which is a significant increase
            • probably more important is that I now have table-per-month for DiaSource and DiaForcedSource tables, that means that I have to run 12 time more queries to collect 12 months of data.

            Even for DiaObjectLast table the number of select queries per client/visit is significant, at the level of ~50 on average. This is mostly because I try to optimize the amount of data returned by these queries and for that I have to run a number of high-granularity queries. This is how it works today:

            • the tables have pixelId (=htm20) as primary index
            • they are also partitioned based on htm8 index
            • I have to write query that limits the number of partitions, and in addition to that from each partition I want to select records that fall into a CCD region, not the whole partition
            • to do that I represent CCD region on the sky as a set of htm20 intervals enveloping CCD, the size of that set is limited to 64 intervals but actual number can be lower
            • for each interval in the set I find interval of partitioning (htm8) indices overlapping that interval, than for each partition I add a query "SELECT FROM Table WHERE apdb_part = :htm8 AND pixelId >= :htm20_start AND pixelId <= :htm20_end"

            There maybe small optimizations possible here but I also want to try different approach:

            • htm triangle shape does not match square CCDs too well, we may want to consider q3c for partitioning
            • to reduce the number of queries drastically I could just retrieve all records from all partitions overlapping a CCD, that will reduce the number of queries to a number of partitions involved (and it can also be ran in one query using "apdb_part IN (:part1, :part2, ...)" syntax which may have the same cost on server side but it's still one query for client)
            • depending on relative size of CCD and partition it can return much more data than needed, with smaller partitions overhead will be lower, need to check typical numbers. I think 2x or 3x overhead is still acceptable if performance does not suffer

            I want to run my partitioning simulator again and to add partition area calculator to the mix to see what optimal values could be.

            Show
            salnikov Andy Salnikov added a comment - After restarting my test I see some random timeouts on Cassandra side even with small number of visits. I believe these are due to scaling issues that resulted in significantly larger number of queries in my current setup: the number of clients grew from 81 to 189 which is a significant increase probably more important is that I now have table-per-month for DiaSource and DiaForcedSource tables, that means that I have to run 12 time more queries to collect 12 months of data. Even for DiaObjectLast table the number of select queries per client/visit is significant, at the level of ~50 on average. This is mostly because I try to optimize the amount of data returned by these queries and for that I have to run a number of high-granularity queries. This is how it works today: the tables have pixelId (=htm20) as primary index they are also partitioned based on htm8 index I have to write query that limits the number of partitions, and in addition to that from each partition I want to select records that fall into a CCD region, not the whole partition to do that I represent CCD region on the sky as a set of htm20 intervals enveloping CCD, the size of that set is limited to 64 intervals but actual number can be lower for each interval in the set I find interval of partitioning (htm8) indices overlapping that interval, than for each partition I add a query "SELECT FROM Table WHERE apdb_part = :htm8 AND pixelId >= :htm20_start AND pixelId <= :htm20_end" There maybe small optimizations possible here but I also want to try different approach: htm triangle shape does not match square CCDs too well, we may want to consider q3c for partitioning to reduce the number of queries drastically I could just retrieve all records from all partitions overlapping a CCD, that will reduce the number of queries to a number of partitions involved (and it can also be ran in one query using "apdb_part IN (:part1, :part2, ...)" syntax which may have the same cost on server side but it's still one query for client) depending on relative size of CCD and partition it can return much more data than needed, with smaller partitions overhead will be lower, need to check typical numbers. I think 2x or 3x overhead is still acceptable if performance does not suffer I want to run my partitioning simulator again and to add partition area calculator to the mix to see what optimal values could be.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I have re-ran my partitioning simulator now also calculating pixel area, more precisely for CCD-sized square region located randomly at sky sphere I calculate two values:

            • the number of pixels that overlap this CCD region
            • total area of those pixels

            This is done for three pixelization schemes existing now in sphgeom an for a number of pixelization levels. Here are the distributions for these two values (# pixels on the left, are on the right):

            (also https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-23604%20Optimize%20partitioning.ipynb)

            My original idea was that we want to reduce the number of pixels per tile/CCD while also trying to map different tiles to different partitions, so previously I decided on htm8 for partitioning, Because the area in that case would be large (it is not included on the above plot, but average is about 0.38 deg^2) I also had to do additional filtering based on pixelId/htm20 and that caused scaling issues. The new idea is to avoid htm20 filtering and select whole partitions but make partitions smaller while keeping the number of partitions reasonable. from the above plots I think that level 10 is probably best compromise between overhead (area is ~0.10 deg^2 or about 2x CCD size) and number of pixels (around 20 average). From the distributions I think that mq3c should work better than htm, so this is probably what I want to use for the next round of tests.

            Show
            salnikov Andy Salnikov added a comment - - edited I have re-ran my partitioning simulator now also calculating pixel area, more precisely for CCD-sized square region located randomly at sky sphere I calculate two values: the number of pixels that overlap this CCD region total area of those pixels This is done for three pixelization schemes existing now in sphgeom an for a number of pixelization levels. Here are the distributions for these two values (# pixels on the left, are on the right): (also https://github.com/lsst-dm/l1dbproto-notebooks/blob/master/DM-23604%20Optimize%20partitioning.ipynb ) My original idea was that we want to reduce the number of pixels per tile/CCD while also trying to map different tiles to different partitions, so previously I decided on htm8 for partitioning, Because the area in that case would be large (it is not included on the above plot, but average is about 0.38 deg^2) I also had to do additional filtering based on pixelId/htm20 and that caused scaling issues. The new idea is to avoid htm20 filtering and select whole partitions but make partitions smaller while keeping the number of partitions reasonable. from the above plots I think that level 10 is probably best compromise between overhead (area is ~0.10 deg^2 or about 2x CCD size) and number of pixels (around 20 average). From the distributions I think that mq3c should work better than htm, so this is probably what I want to use for the next round of tests.
            Hide
            salnikov Andy Salnikov added a comment -

            Before I move to reimplement ap_proto withthe ideas above I want to summarize last round of tests so that we can compare things.

            Here is the setup:

            • all Dia tables are partitioned based on htm8 index
            • DiaSource and DiaForcedSource tables are "manually partitioned" on 30-day intervals, there is a separate table (e.g. DiaSource_608) for each interval, when querying we need to send queries for 12-13 most recent tables.
            • geometry has 189 CCDs
            • I do not make DiaObjects any more for forced photometry on missing DiaSources
            • cutoff for forced photometry is 30 days since last observation of DiaSource
            • I did not do manual compaction for this test, it takes long time and I did not see any change due to compaction in previous tests
            • about 35k visits were generated in this round
            • ap_proto assumes average 10 hours of observation time per night with 45 seconds per visit, making 800 visits per night. Exact number mostly does not matter as most results so far are presented as a function of number of visits. Some values do depend on that 800/night number, e.g. forced photometry cutoff is at 2400 visits in ap_proto terms.

            As I already mentioned for this run I observe significant number of timeouts, these are most likely due to large number of clients and number of queries due to per-month tables (and very small number of servers). Timeouts appear randomly, and when there are no timeouts things seem to be working reasonably OK.

            Here is a bunch of plots from grafana.

            Write latency for each table (all latencies are measured on server side):

            Interesting here is that average latency for DiaObject is significantly lower than what I saw in previous test (it was ~200msec before).

            Read latency for each table:

            Interesting observation here is that master02 latency is higher than for two other nodes, and there are also lots of fluctuations on that node.

            Timing for inserts (timing is measured on client side):

            and for selects:

            Select time for DiaSource tables looks worse than in previous tests, it rises to ~4 sec after 30k visits, which does not look good (though some fraction of that time is CPU time on client).

            Counters for the number of queries send by each of the 189 clients:

            Number of records retrieved from database for each client:

            Number of records stored (per client/CCD):

            On the last plot the number of forced sources levels off after one month (24k visits).

            On these plots things are generally smoothed by grafana, looking at narrower intervals there is a lot more variation in latices for example.

            Show
            salnikov Andy Salnikov added a comment - Before I move to reimplement ap_proto withthe ideas above I want to summarize last round of tests so that we can compare things. Here is the setup: all Dia tables are partitioned based on htm8 index DiaSource and DiaForcedSource tables are "manually partitioned" on 30-day intervals, there is a separate table (e.g. DiaSource_608) for each interval, when querying we need to send queries for 12-13 most recent tables. geometry has 189 CCDs I do not make DiaObjects any more for forced photometry on missing DiaSources cutoff for forced photometry is 30 days since last observation of DiaSource I did not do manual compaction for this test, it takes long time and I did not see any change due to compaction in previous tests about 35k visits were generated in this round ap_proto assumes average 10 hours of observation time per night with 45 seconds per visit, making 800 visits per night. Exact number mostly does not matter as most results so far are presented as a function of number of visits. Some values do depend on that 800/night number, e.g. forced photometry cutoff is at 2400 visits in ap_proto terms. As I already mentioned for this run I observe significant number of timeouts, these are most likely due to large number of clients and number of queries due to per-month tables (and very small number of servers). Timeouts appear randomly, and when there are no timeouts things seem to be working reasonably OK. Here is a bunch of plots from grafana. Write latency for each table (all latencies are measured on server side): Interesting here is that average latency for DiaObject is significantly lower than what I saw in previous test (it was ~200msec before). Read latency for each table: Interesting observation here is that master02 latency is higher than for two other nodes, and there are also lots of fluctuations on that node. Timing for inserts (timing is measured on client side): and for selects: Select time for DiaSource tables looks worse than in previous tests, it rises to ~4 sec after 30k visits, which does not look good (though some fraction of that time is CPU time on client). Counters for the number of queries send by each of the 189 clients: Number of records retrieved from database for each client: Number of records stored (per client/CCD): On the last plot the number of forced sources levels off after one month (24k visits). On these plots things are generally smoothed by grafana, looking at narrower intervals there is a lot more variation in latices for example.
            Hide
            salnikov Andy Salnikov added a comment -

            I think it's time to switch to a new ticket for the next round of tests, I'm closing this one and opening DM-23881.

            Show
            salnikov Andy Salnikov added a comment - I think it's time to switch to a new ticket for the next round of tests, I'm closing this one and opening DM-23881 .

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.