Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25055

Test Cassandra APDB with three replicas without docker

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      6
    • Sprint:
      DB_S20_02, DB_F20_06
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      This is probably the last test that I want to do with Cassandra on PDAC cluster and before we move to something more scalable.

      The only test with three replicas that was done with Cassandra was made with docker which is totally inadequate. I want to repeat this without docker but with just three nodes as it was done with Scylla.

      Need to remove Scylla now, reinstall Cassandra and restore the same config that was used for previous tests. Also I think it's better for Cassandra to split storage into individual disks as before, the issue is that storage on master02 is limited so it needs some care in assigning it.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            It looks like there is also per-table option to use row cache which is disabled by default. I'll try to enable it for existing tables and do another run to see if it changes anything.

            Show
            salnikov Andy Salnikov added a comment - It looks like there is also per-table option to use row cache which is disabled by default. I'll try to enable it for existing tables and do another run to see if it changes anything.
            Hide
            salnikov Andy Salnikov added a comment -

            After enabling row cache an all tables I ran ap_proto with another 10 visits. Metrics show that cache is indeed i use now, but timing on client side did not improve, so I conclude that row caching is not useful for our use case.

            Show
            salnikov Andy Salnikov added a comment - After enabling row cache an all tables I ran ap_proto with another 10 visits. Metrics show that cache is indeed i use now, but timing on client side did not improve, so I conclude that row caching is not useful for our use case.
            Hide
            salnikov Andy Salnikov added a comment -

            Here is the summary for this round of tests.

            Setup:

            • code name: cass3
            • 189k visits generated
            • replication factor 3, with three nodes it means that each node keeps whole set of data
            • it uses separate data disks (4 disks on master02, 5 disks on master03,04)
            • JVM memory limit is 128GiB on every node
            • 256 tokens allocated on each node, meaning equal load for all

            Data sizes:

            $ nodetool status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
            UN  141.142.181.162  3.86 TiB   256          100.0%            c4a6f3da-2edb-4c3d-9bb0-49e646eabe62  rack1
            UN  141.142.181.163  3.86 TiB   256          100.0%            844a3b1d-ab25-4398-a158-3c1cca785977  rack1
            UN  141.142.181.129  3.86 TiB   256          100.0%            c4989690-010a-40bb-98ad-cb42f9cb3350  rack1
            

            Disk usage on master02:

            $ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/sdb1       1.4T  999G  341G  75% /local_data/apdb1
            /dev/sdd1       1.4T 1004G  336G  75% /local_data/apdb2
            /dev/sde1       1.4T  999G  342G  75% /local_data/apdb3
            /dev/sdf1       1.4T  997G  344G  75% /local_data/apdb4
            2.4M    /local_data/apdb1/commitlog
            999G    /local_data/apdb1/data
            0       /local_data/apdb1/hints
            42M     /local_data/apdb1/saved_caches
            1004G   /local_data/apdb2/data
            999G    /local_data/apdb3/data
            997G    /local_data/apdb4/data
            

            Disk usage on master03:

            $ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/nvme0n1p1  3.7T   79M  3.7T   1% /local_data/apdb1
            /dev/nvme1n1p1  3.7T  999G  2.7T  27% /local_data/apdb2
            /dev/nvme2n1p1  3.7T 1004G  2.7T  27% /local_data/apdb3
            /dev/nvme3n1p1  3.7T  999G  2.7T  27% /local_data/apdb4
            /dev/nvme4n1p1  3.7T  997G  2.7T  27% /local_data/apdb5
            4.8M    /local_data/apdb1/commitlog
            0       /local_data/apdb1/hints
            42M     /local_data/apdb1/saved_caches
            999G    /local_data/apdb2/data
            1004G   /local_data/apdb3/data
            999G    /local_data/apdb4/data
            997G    /local_data/apdb5/data
            

            Disk usage on master04:

            $ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/nvme0n1p1  3.7T   88M  3.7T   1% /local_data/apdb1
            /dev/nvme1n1p1  3.7T  999G  2.7T  27% /local_data/apdb2
            /dev/nvme2n1p1  3.7T 1004G  2.7T  27% /local_data/apdb3
            /dev/nvme3n1p1  3.7T  999G  2.7T  27% /local_data/apdb4
            /dev/nvme4n1p1  3.7T  997G  2.7T  27% /local_data/apdb5
            14M     /local_data/apdb1/commitlog
            0       /local_data/apdb1/hints
            42M     /local_data/apdb1/saved_caches
            999G    /local_data/apdb2/data
            1004G   /local_data/apdb3/data
            999G    /local_data/apdb4/data
            997G    /local_data/apdb5/data
            

            Some metrics from cassandra.

            System load for all three nodes. This shows the period when dynamic snitching was disabled and master02 was locked into serving data requests while two other nodes serving digests. With re-enabled dynamic snitching (on 6/1) load is spread more evenly:

            Read repair rate, mot of repairs happen on DiaObjectLast table, and rate drops with time. I think that this is related to the visit processing time, the longer it takes to read the data the longer is interval between reads and writes and this reduces chances of overlapping reads/writes:

            OTOH the metrics for active read repair tasks count doe not reduce with time, and this is something I do not quite understand:

            Here are standard insert/select timing from client. Select time for forced sources was severely affected by disabled dynamic snitching, it was growing faster than select time for sources. After re-enabling dynamic snitching picture reverted to a familiar ratio seen in other tests:

            And here are standard plots from notebook.

            Select time for each table:


            And combined fits (fit itself is wrong of course, don't look at it):

            Comparing this to numbers/plots from other tickets:

            • I think the numbers are consistent with Cassandra Docker test where we ran 14 instances
            • Performance of Scylla seems to be significantly better, approximately 1.5 faster in reading sources, and about twice as fast for forced sources

            Regarding Docker comparison - I thought that our Docker setup was very sub-optimal for client/server communication, apparently that was not a bottleneck. Scylla comparison is surprising too, the difference in setup is that for Scylla we combined disks into a single array. OTOH it may be just Scylla is indeed much better. I think I can do another quick tests by joining disks together again to see if it makes a difference for Cassandra.

            Show
            salnikov Andy Salnikov added a comment - Here is the summary for this round of tests. Setup: code name: cass3 189k visits generated replication factor 3, with three nodes it means that each node keeps whole set of data it uses separate data disks (4 disks on master02, 5 disks on master03,04) JVM memory limit is 128GiB on every node 256 tokens allocated on each node, meaning equal load for all Data sizes: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 141.142.181.162 3.86 TiB 256 100.0% c4a6f3da-2edb-4c3d-9bb0-49e646eabe62 rack1 UN 141.142.181.163 3.86 TiB 256 100.0% 844a3b1d-ab25-4398-a158-3c1cca785977 rack1 UN 141.142.181.129 3.86 TiB 256 100.0% c4989690-010a-40bb-98ad-cb42f9cb3350 rack1 Disk usage on master02: $ df -h /local_data/apdb*; du -sh /local_data/apdb*/* Filesystem Size Used Avail Use% Mounted on /dev/sdb1 1.4T 999G 341G 75% /local_data/apdb1 /dev/sdd1 1.4T 1004G 336G 75% /local_data/apdb2 /dev/sde1 1.4T 999G 342G 75% /local_data/apdb3 /dev/sdf1 1.4T 997G 344G 75% /local_data/apdb4 2.4M /local_data/apdb1/commitlog 999G /local_data/apdb1/data 0 /local_data/apdb1/hints 42M /local_data/apdb1/saved_caches 1004G /local_data/apdb2/data 999G /local_data/apdb3/data 997G /local_data/apdb4/data Disk usage on master03: $ df -h /local_data/apdb*; du -sh /local_data/apdb*/* Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 3.7T 79M 3.7T 1% /local_data/apdb1 /dev/nvme1n1p1 3.7T 999G 2.7T 27% /local_data/apdb2 /dev/nvme2n1p1 3.7T 1004G 2.7T 27% /local_data/apdb3 /dev/nvme3n1p1 3.7T 999G 2.7T 27% /local_data/apdb4 /dev/nvme4n1p1 3.7T 997G 2.7T 27% /local_data/apdb5 4.8M /local_data/apdb1/commitlog 0 /local_data/apdb1/hints 42M /local_data/apdb1/saved_caches 999G /local_data/apdb2/data 1004G /local_data/apdb3/data 999G /local_data/apdb4/data 997G /local_data/apdb5/data Disk usage on master04: $ df -h /local_data/apdb*; du -sh /local_data/apdb*/* Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 3.7T 88M 3.7T 1% /local_data/apdb1 /dev/nvme1n1p1 3.7T 999G 2.7T 27% /local_data/apdb2 /dev/nvme2n1p1 3.7T 1004G 2.7T 27% /local_data/apdb3 /dev/nvme3n1p1 3.7T 999G 2.7T 27% /local_data/apdb4 /dev/nvme4n1p1 3.7T 997G 2.7T 27% /local_data/apdb5 14M /local_data/apdb1/commitlog 0 /local_data/apdb1/hints 42M /local_data/apdb1/saved_caches 999G /local_data/apdb2/data 1004G /local_data/apdb3/data 999G /local_data/apdb4/data 997G /local_data/apdb5/data Some metrics from cassandra. System load for all three nodes. This shows the period when dynamic snitching was disabled and master02 was locked into serving data requests while two other nodes serving digests. With re-enabled dynamic snitching (on 6/1) load is spread more evenly: Read repair rate, mot of repairs happen on DiaObjectLast table, and rate drops with time. I think that this is related to the visit processing time, the longer it takes to read the data the longer is interval between reads and writes and this reduces chances of overlapping reads/writes: OTOH the metrics for active read repair tasks count doe not reduce with time, and this is something I do not quite understand: Here are standard insert/select timing from client. Select time for forced sources was severely affected by disabled dynamic snitching, it was growing faster than select time for sources. After re-enabling dynamic snitching picture reverted to a familiar ratio seen in other tests: And here are standard plots from notebook. Select time for each table: And combined fits (fit itself is wrong of course, don't look at it): Comparing this to numbers/plots from other tickets: I think the numbers are consistent with Cassandra Docker test where we ran 14 instances Performance of Scylla seems to be significantly better, approximately 1.5 faster in reading sources, and about twice as fast for forced sources Regarding Docker comparison - I thought that our Docker setup was very sub-optimal for client/server communication, apparently that was not a bottleneck. Scylla comparison is surprising too, the difference in setup is that for Scylla we combined disks into a single array. OTOH it may be just Scylla is indeed much better. I think I can do another quick tests by joining disks together again to see if it makes a difference for Cassandra.
            Hide
            salnikov Andy Salnikov added a comment -

            Summary of Cassandra test with 3 replicas and raid0:

            • code name: cass4
            • replication factor 3
            • 190k visits generated,
              • out of these 180k visit with read consistency QUORUM
              • last 10k visits with read consistency ONE
            • all data dists combined into single LVM raid0 volume
            • JVM memory limit is 128GiB on every node
            • 256 tokens allocated on each node, meaning equal load for all

            Data size:

            $ nodetool status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
            UN  141.142.181.162  3.7 TiB    256          100.0%            84788dbf-1c78-4de2-b709-405af3dc39bd  rack1
            UN  141.142.181.163  3.69 TiB   256          100.0%            316eeb71-8a5c-4d33-808b-7f9dfef867c0  rack1
            UN  141.142.181.129  3.69 TiB   256          100.0%            5aff767a-22c7-491c-ad2d-aa488cdca6e7  rack1
            

            master02:

            # df -h /local_data/apdb1; du -sh /local_data/apdb1/*
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb  5.3T  3.7T  1.6T  71% /local_data/apdb1
            4.5M    /local_data/apdb1/commitlog
            3.7T    /local_data/apdb1/data
            0       /local_data/apdb1/hints
            40M     /local_data/apdb1/saved_caches
            

            master03:

            # df -h /local_data/apdb1; du -sh /local_data/apdb1/*
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb   19T  3.7T   15T  21% /local_data/apdb1
            8.6M    /local_data/apdb1/commitlog
            3.7T    /local_data/apdb1/data
            0       /local_data/apdb1/hints
            38M     /local_data/apdb1/saved_caches
            

            master04:

            # df -h /local_data/apdb1; du -sh /local_data/apdb1/*
            Filesystem                 Size  Used Avail Use% Mounted on
            /dev/mapper/VGapdb-LVapdb   19T  3.7T   15T  21% /local_data/apdb1
            17M     /local_data/apdb1/commitlog
            3.7T    /local_data/apdb1/data
            0       /local_data/apdb1/hints
            37M     /local_data/apdb1/saved_caches
            

            Grafana plots, as I mentioned above last 10k visits were run with consistency level "ONE" for reading (and "QUORUM" for writing), the effect is clearly visible on these plots, reading is significantly faster:

            Notebook plots, read time for each table, again the effect is clear:


            Fit of the read times for initial 180k events (with QUORUM reads):

            and for the last 10k events for ONE reads:

            I have also tried to look at query tracing information, which is extremely difficult to interpret due to very large number of clients. But I think there is an indication that master02 is a botlleneck when reading data, in some queries that I saw it responds much slower than other two machines. This is probably expected due to much slower storage on master02. And I do not know if there is a way to quantify that, other numbers like latencies do not show big discrepancy.

            Overall conclusion for this round of test is that performance did not change when switching from individual disks to RAID0 setup.

            Show
            salnikov Andy Salnikov added a comment - Summary of Cassandra test with 3 replicas and raid0: code name: cass4 replication factor 3 190k visits generated, out of these 180k visit with read consistency QUORUM last 10k visits with read consistency ONE all data dists combined into single LVM raid0 volume JVM memory limit is 128GiB on every node 256 tokens allocated on each node, meaning equal load for all Data size: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 141.142.181.162 3.7 TiB 256 100.0% 84788dbf-1c78-4de2-b709-405af3dc39bd rack1 UN 141.142.181.163 3.69 TiB 256 100.0% 316eeb71-8a5c-4d33-808b-7f9dfef867c0 rack1 UN 141.142.181.129 3.69 TiB 256 100.0% 5aff767a-22c7-491c-ad2d-aa488cdca6e7 rack1 master02: # df -h /local_data/apdb1; du -sh /local_data/apdb1/* Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 5.3T 3.7T 1.6T 71% /local_data/apdb1 4.5M /local_data/apdb1/commitlog 3.7T /local_data/apdb1/data 0 /local_data/apdb1/hints 40M /local_data/apdb1/saved_caches master03: # df -h /local_data/apdb1; du -sh /local_data/apdb1/* Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.7T 15T 21% /local_data/apdb1 8.6M /local_data/apdb1/commitlog 3.7T /local_data/apdb1/data 0 /local_data/apdb1/hints 38M /local_data/apdb1/saved_caches master04: # df -h /local_data/apdb1; du -sh /local_data/apdb1/* Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGapdb-LVapdb 19T 3.7T 15T 21% /local_data/apdb1 17M /local_data/apdb1/commitlog 3.7T /local_data/apdb1/data 0 /local_data/apdb1/hints 37M /local_data/apdb1/saved_caches Grafana plots, as I mentioned above last 10k visits were run with consistency level "ONE" for reading (and "QUORUM" for writing), the effect is clearly visible on these plots, reading is significantly faster: Notebook plots, read time for each table, again the effect is clear: Fit of the read times for initial 180k events (with QUORUM reads): and for the last 10k events for ONE reads: I have also tried to look at query tracing information, which is extremely difficult to interpret due to very large number of clients. But I think there is an indication that master02 is a botlleneck when reading data, in some queries that I saw it responds much slower than other two machines. This is probably expected due to much slower storage on master02. And I do not know if there is a way to quantify that, other numbers like latencies do not show big discrepancy. Overall conclusion for this round of test is that performance did not change when switching from individual disks to RAID0 setup.
            Hide
            salnikov Andy Salnikov added a comment -

            Closing this as complete, no review necessary.

            Show
            salnikov Andy Salnikov added a comment - Closing this as complete, no review necessary.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  CI Builds

                  No builds found.