Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23881

Test Cassandra APDB implementaion with finer partitioning

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      12
    • Sprint:
      DB_S20_02
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      This is continuation of tests from DM-23604, for the next round of tests I want to try an approach outlined in that ticket:

      • use finer partitioning, MQ3C level=10 could be optimal
      • do not filter on pixelId in queries, all filtering is done on client side

      This should reduce number of queries that we send to Cassandra and hopefully solve scaling issues that we saw.

        Attachments

        1. apdb-mpi7-counter-select.png
          apdb-mpi7-counter-select.png
          45 kB
        2. apdb-mpi7-counter-store.png
          apdb-mpi7-counter-store.png
          35 kB
        3. apdb-mpi7-gc-time.png
          apdb-mpi7-gc-time.png
          111 kB
        4. apdb-mpi7-nb-time-select-fit.png
          apdb-mpi7-nb-time-select-fit.png
          89 kB
        5. apdb-mpi7-nb-time-select-fsrc.png
          apdb-mpi7-nb-time-select-fsrc.png
          68 kB
        6. apdb-mpi7-nb-time-select-obj.png
          apdb-mpi7-nb-time-select-obj.png
          52 kB
        7. apdb-mpi7-nb-time-select-src.png
          apdb-mpi7-nb-time-select-src.png
          65 kB
        8. apdb-mpi7-nb-time-store-fit.png
          apdb-mpi7-nb-time-store-fit.png
          90 kB
        9. apdb-mpi7-time-select.png
          apdb-mpi7-time-select.png
          63 kB
        10. apdb-mpi7-time-store.png
          apdb-mpi7-time-store.png
          53 kB
        11. apdb-mpi8-counter-select.png
          apdb-mpi8-counter-select.png
          60 kB
        12. apdb-mpi8-counter-store.png
          apdb-mpi8-counter-store.png
          42 kB
        13. apdb-mpi8-gc-time.png
          apdb-mpi8-gc-time.png
          136 kB
        14. apdb-mpi8-nb-time-select-fit.png
          apdb-mpi8-nb-time-select-fit.png
          84 kB
        15. apdb-mpi8-nb-time-select-fsrc.png
          apdb-mpi8-nb-time-select-fsrc.png
          54 kB
        16. apdb-mpi8-nb-time-select-obj.png
          apdb-mpi8-nb-time-select-obj.png
          57 kB
        17. apdb-mpi8-nb-time-select-src.png
          apdb-mpi8-nb-time-select-src.png
          55 kB
        18. apdb-mpi8-nb-time-store-fit.png
          apdb-mpi8-nb-time-store-fit.png
          88 kB
        19. apdb-mpi8-time-select.png
          apdb-mpi8-time-select.png
          62 kB
        20. apdb-mpi8-time-store.png
          apdb-mpi8-time-store.png
          81 kB
        21. apdb-mpi9-gc-time.png
          apdb-mpi9-gc-time.png
          245 kB
        22. apdb-mpi9-nb-time-select-fit.png
          apdb-mpi9-nb-time-select-fit.png
          86 kB
        23. apdb-mpi9-nb-time-select-fsrc.png
          apdb-mpi9-nb-time-select-fsrc.png
          54 kB
        24. apdb-mpi9-nb-time-select-obj.png
          apdb-mpi9-nb-time-select-obj.png
          57 kB
        25. apdb-mpi9-nb-time-select-src.png
          apdb-mpi9-nb-time-select-src.png
          50 kB
        26. apdb-mpi9-nb-time-store-fit.png
          apdb-mpi9-nb-time-store-fit.png
          78 kB
        27. apdb-mpi9-time-select.png
          apdb-mpi9-time-select.png
          71 kB
        28. apdb-mpi9-time-store.png
          apdb-mpi9-time-store.png
          72 kB

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment - - edited

            Before trying to run multiple cassandra instances per host (with docker) I decided to re-run the test with reduced JVM memory allocation (96GB instead of 160GB, still with G1 GC) and replication factor 2, that means twice as much data per node and twice as much load when storing the data. Some cassandra options were changed too to account for reduced JVM memory. I only ran it for 100k visits but in general this configuration behaved better. I did not see any timeouts at all and GC collection times were more reasonable.

            Here are some plots from grafana:

            Number of records stored and selected, this numbers are from client side (ap_proto) they are consistent with the above plots:

            More interesting grafana plot time to store data per CCD:

            The numbers are ~50% higher compared to previous run, this is likely due to replication, still numbers look, significantly lower than read time.

            Time to read data (per CCD):

            These look consistent with previous run, for reading I user consistency level 1, meaning that response from one replica was requested so it is probably similar to single replica case (but this is not what we should do in production)

            And garbage collection time:

            There are spikes here too but they are much smaller scale.

            and the plots from my notebook with visit numbers for X axis,

            Read time for three tables:



            There are fewer oultiers here compared to previous case.

            Combined fits for read time:

            And combined fit for store time:

            (peculiar behavior, seems that it improves with the visit number, though I don't think it will turn negative )

            Show
            salnikov Andy Salnikov added a comment - - edited Before trying to run multiple cassandra instances per host (with docker) I decided to re-run the test with reduced JVM memory allocation (96GB instead of 160GB, still with G1 GC) and replication factor 2, that means twice as much data per node and twice as much load when storing the data. Some cassandra options were changed too to account for reduced JVM memory. I only ran it for 100k visits but in general this configuration behaved better. I did not see any timeouts at all and GC collection times were more reasonable. Here are some plots from grafana: Number of records stored and selected, this numbers are from client side (ap_proto) they are consistent with the above plots: More interesting grafana plot time to store data per CCD: The numbers are ~50% higher compared to previous run, this is likely due to replication, still numbers look, significantly lower than read time. Time to read data (per CCD): These look consistent with previous run, for reading I user consistency level 1, meaning that response from one replica was requested so it is probably similar to single replica case (but this is not what we should do in production) And garbage collection time: There are spikes here too but they are much smaller scale. and the plots from my notebook with visit numbers for X axis, Read time for three tables: There are fewer oultiers here compared to previous case. Combined fits for read time: And combined fit for store time: (peculiar behavior, seems that it improves with the visit number, though I don't think it will turn negative )
            Hide
            salnikov Andy Salnikov added a comment -

            For the next step I want to try to run multiple Cassandra instances per host and also use more reasonable replication and consistency settings. For replication I think the reasonable setting is 3, and for consistency we want level 2. Ask me if you want me to explain what it means and why.

            Show
            salnikov Andy Salnikov added a comment - For the next step I want to try to run multiple Cassandra instances per host and also use more reasonable replication and consistency settings. For replication I think the reasonable setting is 3, and for consistency we want level 2. Ask me if you want me to explain what it means and why.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Some numbers from the latest test with Cassandra, the setup:

            • total 14 Cassandra instances on 3 nodes (4+5+5, one per data disk) running in docker containers
            • only 3 of those were exposed to public network and could serve as coordinators, this is not an optimal setup, better setup is when clients can connect to all instances (the limit in our case is that Cassandra can only run on fixed port number)
            • replication factor is set to 3, client-side consistency level is set to QUORUM, meaning that coordinator node waits for response from 2 instances to return success to client
            • I have allocated 32GB initially to each Cassandra JVM and switched to 24GB later (after 10 or 15k visits), there is still some paging activity happening on all machines, not terrible but non-zero
            • no timeouts were noticed, it worked reasonably stable, GC timing was significant but without big spikes
            • like in previous case 100k visits were generated
            • JVM/Cassandra metrics were only collected from 3 instances (same instances that were used as coordinators)
            • there was no explicit flush or compaction done at any point

            Some plots from grafana (counters look exactly the same, no pint repeating):

            Timing for store operations on client side for each separate table (this is per client/CCD):

            The numbers look somewhat better than in previous case.

            Timing for select operations per table:

            This look somewhat slower (~50%) than previous test.

            Time for garbage collection (this is Cassandra metrics, per instance, only for 3 out of 14 instances, but they are most busy instances because they also do coordination)

            Plots from notebooks, same values but plotted against visit number.

            Read time for three tables:


            And combined fits for read/store times:

            From the fit it looks like total select time is slower by 60% and select time for sources is 55% slower. This is not terrible given that our setup is very inefficient and clearly need to be scaled better.

            Show
            salnikov Andy Salnikov added a comment - - edited Some numbers from the latest test with Cassandra, the setup: total 14 Cassandra instances on 3 nodes (4+5+5, one per data disk) running in docker containers only 3 of those were exposed to public network and could serve as coordinators, this is not an optimal setup, better setup is when clients can connect to all instances (the limit in our case is that Cassandra can only run on fixed port number) replication factor is set to 3, client-side consistency level is set to QUORUM, meaning that coordinator node waits for response from 2 instances to return success to client I have allocated 32GB initially to each Cassandra JVM and switched to 24GB later (after 10 or 15k visits), there is still some paging activity happening on all machines, not terrible but non-zero no timeouts were noticed, it worked reasonably stable, GC timing was significant but without big spikes like in previous case 100k visits were generated JVM/Cassandra metrics were only collected from 3 instances (same instances that were used as coordinators) there was no explicit flush or compaction done at any point Some plots from grafana (counters look exactly the same, no pint repeating): Timing for store operations on client side for each separate table (this is per client/CCD): The numbers look somewhat better than in previous case. Timing for select operations per table: This look somewhat slower (~50%) than previous test. Time for garbage collection (this is Cassandra metrics, per instance, only for 3 out of 14 instances, but they are most busy instances because they also do coordination) Plots from notebooks, same values but plotted against visit number. Read time for three tables: And combined fits for read/store times: From the fit it looks like total select time is slower by 60% and select time for sources is 55% slower. This is not terrible given that our setup is very inefficient and clearly need to be scaled better.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            And just for reference this is the data load for the cluster after 100k visits, with replication factor 3 total percentage should be 300%. About 420GB per instance.

            $ nodetool -u cassandra -pw ****** status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
            UN  10.0.2.128  430.76 GiB  256          21.8%             8a9cdb8e-6803-4cbf-ac8c-8d4d85ab0514  rack1
            UN  10.0.2.129  419.32 GiB  256          21.2%             e96a645a-49d5-47cb-8b38-f5a8aa863e4a  rack1
            UN  10.0.2.130  419.18 GiB  256          21.2%             2f047aad-311e-4206-82ed-6ebe77c4fda8  rack1
            UN  10.0.2.131  427.85 GiB  256          21.6%             3064065e-fa09-4d54-b9e0-d69556609315  rack1
            UN  10.0.2.132  419.8 GiB  256          21.2%             8249d921-61b7-468e-a39a-68f45ad75269  rack1
            UN  10.0.2.133  412.44 GiB  256          20.8%             504440cd-db85-40dc-93ec-270a5116b4cc  rack1
            UN  10.0.2.134  410.09 GiB  256          20.7%             6e155a84-80f3-490c-b894-3e73ba594429  rack1
            UN  10.0.2.135  428.27 GiB  256          21.6%             3c3f3bbf-118e-451b-9b97-308f6cb40cc9  rack1
            UN  10.0.2.136  438.25 GiB  256          22.1%             b39cf438-2705-4046-ada5-9f95bb6c2711  rack1
            UN  10.0.2.137  444.78 GiB  256          22.5%             8d3b9c58-a3fa-47ae-a3da-8b080a06499c  rack1
            UN  10.0.2.138  414.29 GiB  256          20.9%             83e6a0c3-7465-408b-bae1-9aa638e23510  rack1
            UN  10.0.2.139  429.94 GiB  256          21.6%             0f68c450-2289-4544-a444-8cdba575d982  rack1
            UN  10.0.2.125  431.14 GiB  256          21.7%             28392fb5-0f33-4303-96e7-de433d04c6df  rack1
            UN  10.0.2.127  418.78 GiB  256          21.1%             f3f80675-99ca-43c4-b0f4-cdae05edd00c  rack1
             

             

            Show
            salnikov Andy Salnikov added a comment - - edited And just for reference this is the data load for the cluster after 100k visits, with replication factor 3 total percentage should be 300%. About 420GB per instance. $ nodetool -u cassandra -pw ****** status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.2.128 430.76 GiB 256 21.8% 8a9cdb8e-6803-4cbf-ac8c-8d4d85ab0514 rack1 UN 10.0.2.129 419.32 GiB 256 21.2% e96a645a-49d5-47cb-8b38-f5a8aa863e4a rack1 UN 10.0.2.130 419.18 GiB 256 21.2% 2f047aad-311e-4206-82ed-6ebe77c4fda8 rack1 UN 10.0.2.131 427.85 GiB 256 21.6% 3064065e-fa09-4d54-b9e0-d69556609315 rack1 UN 10.0.2.132 419.8 GiB 256 21.2% 8249d921-61b7-468e-a39a-68f45ad75269 rack1 UN 10.0.2.133 412.44 GiB 256 20.8% 504440cd-db85-40dc-93ec-270a5116b4cc rack1 UN 10.0.2.134 410.09 GiB 256 20.7% 6e155a84-80f3-490c-b894-3e73ba594429 rack1 UN 10.0.2.135 428.27 GiB 256 21.6% 3c3f3bbf-118e-451b-9b97-308f6cb40cc9 rack1 UN 10.0.2.136 438.25 GiB 256 22.1% b39cf438-2705-4046-ada5-9f95bb6c2711 rack1 UN 10.0.2.137 444.78 GiB 256 22.5% 8d3b9c58-a3fa-47ae-a3da-8b080a06499c rack1 UN 10.0.2.138 414.29 GiB 256 20.9% 83e6a0c3-7465-408b-bae1-9aa638e23510 rack1 UN 10.0.2.139 429.94 GiB 256 21.6% 0f68c450-2289-4544-a444-8cdba575d982 rack1 UN 10.0.2.125 431.14 GiB 256 21.7% 28392fb5-0f33-4303-96e7-de433d04c6df rack1 UN 10.0.2.127 418.78 GiB 256 21.1% f3f80675-99ca-43c4-b0f4-cdae05edd00c rack1  
            Hide
            salnikov Andy Salnikov added a comment - - edited

            I'm going to close this ticket, there is enough data and information here already. For summary I think that Cassandra is behaving reasonably given the tiny scale of our cluster. If we are going to use it in production setup we have to run one Cassandra instance per host with all of them on a public (meaning accessible to all clients) network. It looks GC issues can be solved by carefully adjusting GC parameters and using modern G1 GC (and having lots of RAM). NVMe-style SSD seem to work better and Cassandra can manage separate disks itself without RAID complications (and I also read it can be configured to handle disk failures ~transparently).

            I'm starting tests with Scylla (C++ re-implementation of Cassandra), that will go to a separate ticket.

            Show
            salnikov Andy Salnikov added a comment - - edited I'm going to close this ticket, there is enough data and information here already. For summary I think that Cassandra is behaving reasonably given the tiny scale of our cluster. If we are going to use it in production setup we have to run one Cassandra instance per host with all of them on a public (meaning accessible to all clients) network. It looks GC issues can be solved by carefully adjusting GC parameters and using modern G1 GC (and having lots of RAM). NVMe-style SSD seem to work better and Cassandra can manage separate disks itself without RAID complications (and I also read it can be configured to handle disk failures ~transparently). I'm starting tests with Scylla (C++ re-implementation of Cassandra), that will go to a separate ticket.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.