Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28136

Reproduce PDAC APDB tests with GCP

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      4
    • Sprint:
      DB_S21_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      First step for testing APDB on google cloud is to reproduce what has been done on PDAC and compare numbers (and learn other things along the way).

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment - - edited

            One potential issue is the capacity of the client farm, right now I have 3 VMs with 32 cores each (running 189 processes) and I see CPU load ~100% on client side. This may be skewing performance numbers, may be worth to try to expand client farm with few more VMs.

            Show
            salnikov Andy Salnikov added a comment - - edited One potential issue is the capacity of the client farm, right now I have 3 VMs with 32 cores each (running 189 processes) and I see CPU load ~100% on client side. This may be skewing performance numbers, may be worth to try to expand client farm with few more VMs.
            Hide
            salnikov Andy Salnikov added a comment -

            Test started with 3 client machines, after 25k visits I added another 3 machines. CPU load looks better, there are occasional 100% spikes, but otherwise it stays at 50-70%.

            Tests are running smoothly mostly, but I have seen one crash already due to some MPI issue, does not seem to be related to Casandra. Crash happened at visit 29241, log file is apdb-gcp-1-20201219T211123.log.

            Show
            salnikov Andy Salnikov added a comment - Test started with 3 client machines, after 25k visits I added another 3 machines. CPU load looks better, there are occasional 100% spikes, but otherwise it stays at 50-70%. Tests are running smoothly mostly, but I have seen one crash already due to some MPI issue, does not seem to be related to Casandra. Crash happened at visit 29241, log file is apdb-gcp-1-20201219T211123.log.
            Hide
            salnikov Andy Salnikov added a comment -

            That crash could have been due to me doing stupid things. VSCode Remote for some idiotic reason forwards all ports from all user application and even suggests to open those ports in a browser, to which I said - "hold my beer". Looks like that port was an MPI connection and it did not like my pocking around with a browser.

            Show
            salnikov Andy Salnikov added a comment - That crash could have been due to me doing stupid things. VSCode Remote for some idiotic reason forwards all ports from all user application and even suggests to open those ports in a browser, to which I said - "hold my beer". Looks like that port was an MPI connection and it did not like my pocking around with a browser.
            Hide
            salnikov Andy Salnikov added a comment -

            Generated 50k visits in this setup (25k with 3 client nodes and 25k with 6 nodes). About 1TB of data on each of the server nodes:

            $ nodetool status
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
            UN  10.128.0.110  991.32 GiB  256     ?                 6225fe51-f3fe-425a-aa14-e74b70a27533  rack1
            UN  10.128.0.34   990.41 GiB  256     ?                 978ec347-ced4-4b5d-b0ee-879be015defb  rack1
            UN  10.128.0.30   991.21 GiB  256     ?                 a809055b-1525-4524-9420-ef75843c22bd  rack1
            

            apdb-server-1:~/project/apdb-gcloud$ df -h
            Filesystem                Size  Used Avail Use% Mounted on
            /dev/nvme0n1p1            375G  250G  126G  67% /data/apdb1
            /dev/nvme0n2p1            375G  250G  126G  67% /data/apdb2
            /dev/nvme0n3p1            375G  248G  127G  67% /data/apdb3
            /dev/nvme0n4p1            375G  251G  124G  67% /data/apdb4
            

            General feeling is that performance is about the same as I saw on PDAC cluster, will plot some graphs soon.

            Show
            salnikov Andy Salnikov added a comment - Generated 50k visits in this setup (25k with 3 client nodes and 25k with 6 nodes). About 1TB of data on each of the server nodes: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.128.0.110 991.32 GiB 256 ? 6225fe51-f3fe-425a-aa14-e74b70a27533 rack1 UN 10.128.0.34 990.41 GiB 256 ? 978ec347-ced4-4b5d-b0ee-879be015defb rack1 UN 10.128.0.30 991.21 GiB 256 ? a809055b-1525-4524-9420-ef75843c22bd rack1 apdb-server-1:~/project/apdb-gcloud$ df -h Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 375G 250G 126G 67% /data/apdb1 /dev/nvme0n2p1 375G 250G 126G 67% /data/apdb2 /dev/nvme0n3p1 375G 248G 127G 67% /data/apdb3 /dev/nvme0n4p1 375G 251G 124G 67% /data/apdb4 General feeling is that performance is about the same as I saw on PDAC cluster, will plot some graphs soon.
            Hide
            salnikov Andy Salnikov added a comment -

            JMX monitoring did not quite work, many variables are missing from the output files, I think my monitoring script needs an update, it expects that all variables exists at the time of its start and reads the lit of existing variables and only monitors those. IF that list expands later the script does not care about it. I need a periodic check and update of that list.

             

            Show
            salnikov Andy Salnikov added a comment - JMX monitoring did not quite work, many variables are missing from the output files, I think my monitoring script needs an update, it expects that all variables exists at the time of its start and reads the lit of existing variables and only monitors those. IF that list expands later the script does not care about it. I need a periodic check and update of that list.  
            Hide
            salnikov Andy Salnikov added a comment -

            Summary of the results for this round of tests

            Setup:

            • 3 server nodes, 64 GB RAM
            • replication factor 3
            • each mode has 4x375GB local SSD storage
            • 3 or 6 client machines with 32 vCPU each
            • client-side consistency set to QUORUM for both reads and writes
            • 50k visits generated; 25k visits with 3 clients and 25k with 6 clients

            Select time as a function of visit number:

            Store time:

            Observations from the plots:

            • performance is similar or somewhat better than what was seen at PDAC (see DM-25055)
            • reading performance is not improved with increasing number of clients
            • writing performance slightly dropped when we switch from 3 to 6 client machines, I think this is because clients could submit more concurrent queries. Overall writing performance improved though just because we have more concurrency on client side. (writing performance is not critical, we need to improve reading)

            I think the results look reasonable so far, baseline is similar to what we had in PDAC. I'm going to try extending server cluster with twice the number of nodes next.

            Show
            salnikov Andy Salnikov added a comment - Summary of the results for this round of tests Setup: 3 server nodes, 64 GB RAM replication factor 3 each mode has 4x375GB local SSD storage 3 or 6 client machines with 32 vCPU each client-side consistency set to QUORUM for both reads and writes 50k visits generated; 25k visits with 3 clients and 25k with 6 clients Select time as a function of visit number: Store time: Observations from the plots: performance is similar or somewhat better than what was seen at PDAC (see DM-25055 ) reading performance is not improved with increasing number of clients writing performance slightly dropped when we switch from 3 to 6 client machines, I think this is because clients could submit more concurrent queries. Overall writing performance improved though just because we have more concurrency on client side. (writing performance is not critical, we need to improve reading) I think the results look reasonable so far, baseline is similar to what we had in PDAC. I'm going to try extending server cluster with twice the number of nodes next.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Salnikov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.