# Reproduce PDAC APDB tests with GCP

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Story Points:
4
• Sprint:
DB_S21_12
• Team:
Data Access and Database
• Urgent?:
No

#### Description

First step for testing APDB on google cloud is to reproduce what has been done on PDAC and compare numbers (and learn other things along the way).

#### Attachments

1. apdb-gcp1-nb-time-select-fit.png
90 kB
2. apdb-gcp1-nb-time-store-fit.png
61 kB

#### Activity

Hide
Andy Salnikov added a comment - - edited

One potential issue is the capacity of the client farm, right now I have 3 VMs with 32 cores each (running 189 processes) and I see CPU load ~100% on client side. This may be skewing performance numbers, may be worth to try to expand client farm with few more VMs.

Show
Andy Salnikov added a comment - - edited One potential issue is the capacity of the client farm, right now I have 3 VMs with 32 cores each (running 189 processes) and I see CPU load ~100% on client side. This may be skewing performance numbers, may be worth to try to expand client farm with few more VMs.
Hide
Andy Salnikov added a comment -

Test started with 3 client machines, after 25k visits I added another 3 machines. CPU load looks better, there are occasional 100% spikes, but otherwise it stays at 50-70%.

Tests are running smoothly mostly, but I have seen one crash already due to some MPI issue, does not seem to be related to Casandra. Crash happened at visit 29241, log file is apdb-gcp-1-20201219T211123.log.

Show
Andy Salnikov added a comment - Test started with 3 client machines, after 25k visits I added another 3 machines. CPU load looks better, there are occasional 100% spikes, but otherwise it stays at 50-70%. Tests are running smoothly mostly, but I have seen one crash already due to some MPI issue, does not seem to be related to Casandra. Crash happened at visit 29241, log file is apdb-gcp-1-20201219T211123.log.
Hide
Andy Salnikov added a comment -

That crash could have been due to me doing stupid things. VSCode Remote for some idiotic reason forwards all ports from all user application and even suggests to open those ports in a browser, to which I said - "hold my beer". Looks like that port was an MPI connection and it did not like my pocking around with a browser.

Show
Andy Salnikov added a comment - That crash could have been due to me doing stupid things. VSCode Remote for some idiotic reason forwards all ports from all user application and even suggests to open those ports in a browser, to which I said - "hold my beer". Looks like that port was an MPI connection and it did not like my pocking around with a browser.
Hide
Andy Salnikov added a comment -

Generated 50k visits in this setup (25k with 3 client nodes and 25k with 6 nodes). About 1TB of data on each of the server nodes:

 $nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.128.0.110 991.32 GiB 256 ? 6225fe51-f3fe-425a-aa14-e74b70a27533 rack1 UN 10.128.0.34 990.41 GiB 256 ? 978ec347-ced4-4b5d-b0ee-879be015defb rack1 UN 10.128.0.30 991.21 GiB 256 ? a809055b-1525-4524-9420-ef75843c22bd rack1   apdb-server-1:~/project/apdb-gcloud$ df -h Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 375G 250G 126G 67% /data/apdb1 /dev/nvme0n2p1 375G 250G 126G 67% /data/apdb2 /dev/nvme0n3p1 375G 248G 127G 67% /data/apdb3 /dev/nvme0n4p1 375G 251G 124G 67% /data/apdb4 

General feeling is that performance is about the same as I saw on PDAC cluster, will plot some graphs soon.

Show
Andy Salnikov added a comment - Generated 50k visits in this setup (25k with 3 client nodes and 25k with 6 nodes). About 1TB of data on each of the server nodes: $nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.128.0.110 991.32 GiB 256 ? 6225fe51-f3fe-425a-aa14-e74b70a27533 rack1 UN 10.128.0.34 990.41 GiB 256 ? 978ec347-ced4-4b5d-b0ee-879be015defb rack1 UN 10.128.0.30 991.21 GiB 256 ? a809055b-1525-4524-9420-ef75843c22bd rack1 apdb-server-1:~/project/apdb-gcloud$ df -h Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 375G 250G 126G 67% /data/apdb1 /dev/nvme0n2p1 375G 250G 126G 67% /data/apdb2 /dev/nvme0n3p1 375G 248G 127G 67% /data/apdb3 /dev/nvme0n4p1 375G 251G 124G 67% /data/apdb4 General feeling is that performance is about the same as I saw on PDAC cluster, will plot some graphs soon.
Hide
Andy Salnikov added a comment -

JMX monitoring did not quite work, many variables are missing from the output files, I think my monitoring script needs an update, it expects that all variables exists at the time of its start and reads the lit of existing variables and only monitors those. IF that list expands later the script does not care about it. I need a periodic check and update of that list.

Show
Andy Salnikov added a comment - JMX monitoring did not quite work, many variables are missing from the output files, I think my monitoring script needs an update, it expects that all variables exists at the time of its start and reads the lit of existing variables and only monitors those. IF that list expands later the script does not care about it. I need a periodic check and update of that list.
Hide
Andy Salnikov added a comment -

Summary of the results for this round of tests

Setup:

• 3 server nodes, 64 GB RAM
• replication factor 3
• each mode has 4x375GB local SSD storage
• 3 or 6 client machines with 32 vCPU each
• client-side consistency set to QUORUM for both reads and writes
• 50k visits generated; 25k visits with 3 clients and 25k with 6 clients

Select time as a function of visit number:

Store time:

Observations from the plots:

• performance is similar or somewhat better than what was seen at PDAC (see DM-25055)
• reading performance is not improved with increasing number of clients
• writing performance slightly dropped when we switch from 3 to 6 client machines, I think this is because clients could submit more concurrent queries. Overall writing performance improved though just because we have more concurrency on client side. (writing performance is not critical, we need to improve reading)

I think the results look reasonable so far, baseline is similar to what we had in PDAC. I'm going to try extending server cluster with twice the number of nodes next.

Show
Andy Salnikov added a comment - Summary of the results for this round of tests Setup: 3 server nodes, 64 GB RAM replication factor 3 each mode has 4x375GB local SSD storage 3 or 6 client machines with 32 vCPU each client-side consistency set to QUORUM for both reads and writes 50k visits generated; 25k visits with 3 clients and 25k with 6 clients Select time as a function of visit number: Store time: Observations from the plots: performance is similar or somewhat better than what was seen at PDAC (see DM-25055 ) reading performance is not improved with increasing number of clients writing performance slightly dropped when we switch from 3 to 6 client machines, I think this is because clients could submit more concurrent queries. Overall writing performance improved though just because we have more concurrency on client side. (writing performance is not critical, we need to improve reading) I think the results look reasonable so far, baseline is similar to what we had in PDAC. I'm going to try extending server cluster with twice the number of nodes next.

#### People

Assignee:
Andy Salnikov
Reporter:
Andy Salnikov
Watchers:
Andy Salnikov