Here is the summary for this round of tests.
Setup:
- code name: cass3
- 189k visits generated
- replication factor 3, with three nodes it means that each node keeps whole set of data
- it uses separate data disks (4 disks on master02, 5 disks on master03,04)
- JVM memory limit is 128GiB on every node
- 256 tokens allocated on each node, meaning equal load for all
Data sizes:
$ nodetool status
|
Datacenter: datacenter1
|
=======================
|
Status=Up/Down
|
|/ State=Normal/Leaving/Joining/Moving
|
-- Address Load Tokens Owns (effective) Host ID Rack
|
UN 141.142.181.162 3.86 TiB 256 100.0% c4a6f3da-2edb-4c3d-9bb0-49e646eabe62 rack1
|
UN 141.142.181.163 3.86 TiB 256 100.0% 844a3b1d-ab25-4398-a158-3c1cca785977 rack1
|
UN 141.142.181.129 3.86 TiB 256 100.0% c4989690-010a-40bb-98ad-cb42f9cb3350 rack1
|
Disk usage on master02:
$ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/sdb1 1.4T 999G 341G 75% /local_data/apdb1
|
/dev/sdd1 1.4T 1004G 336G 75% /local_data/apdb2
|
/dev/sde1 1.4T 999G 342G 75% /local_data/apdb3
|
/dev/sdf1 1.4T 997G 344G 75% /local_data/apdb4
|
2.4M /local_data/apdb1/commitlog
|
999G /local_data/apdb1/data
|
0 /local_data/apdb1/hints
|
42M /local_data/apdb1/saved_caches
|
1004G /local_data/apdb2/data
|
999G /local_data/apdb3/data
|
997G /local_data/apdb4/data
|
Disk usage on master03:
$ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/nvme0n1p1 3.7T 79M 3.7T 1% /local_data/apdb1
|
/dev/nvme1n1p1 3.7T 999G 2.7T 27% /local_data/apdb2
|
/dev/nvme2n1p1 3.7T 1004G 2.7T 27% /local_data/apdb3
|
/dev/nvme3n1p1 3.7T 999G 2.7T 27% /local_data/apdb4
|
/dev/nvme4n1p1 3.7T 997G 2.7T 27% /local_data/apdb5
|
4.8M /local_data/apdb1/commitlog
|
0 /local_data/apdb1/hints
|
42M /local_data/apdb1/saved_caches
|
999G /local_data/apdb2/data
|
1004G /local_data/apdb3/data
|
999G /local_data/apdb4/data
|
997G /local_data/apdb5/data
|
Disk usage on master04:
$ df -h /local_data/apdb*; du -sh /local_data/apdb*/*
|
Filesystem Size Used Avail Use% Mounted on
|
/dev/nvme0n1p1 3.7T 88M 3.7T 1% /local_data/apdb1
|
/dev/nvme1n1p1 3.7T 999G 2.7T 27% /local_data/apdb2
|
/dev/nvme2n1p1 3.7T 1004G 2.7T 27% /local_data/apdb3
|
/dev/nvme3n1p1 3.7T 999G 2.7T 27% /local_data/apdb4
|
/dev/nvme4n1p1 3.7T 997G 2.7T 27% /local_data/apdb5
|
14M /local_data/apdb1/commitlog
|
0 /local_data/apdb1/hints
|
42M /local_data/apdb1/saved_caches
|
999G /local_data/apdb2/data
|
1004G /local_data/apdb3/data
|
999G /local_data/apdb4/data
|
997G /local_data/apdb5/data
|
Some metrics from cassandra.
System load for all three nodes. This shows the period when dynamic snitching was disabled and master02 was locked into serving data requests while two other nodes serving digests. With re-enabled dynamic snitching (on 6/1) load is spread more evenly:
Read repair rate, mot of repairs happen on DiaObjectLast table, and rate drops with time. I think that this is related to the visit processing time, the longer it takes to read the data the longer is interval between reads and writes and this reduces chances of overlapping reads/writes:
OTOH the metrics for active read repair tasks count doe not reduce with time, and this is something I do not quite understand:
Here are standard insert/select timing from client. Select time for forced sources was severely affected by disabled dynamic snitching, it was growing faster than select time for sources. After re-enabling dynamic snitching picture reverted to a familiar ratio seen in other tests:
And here are standard plots from notebook.
Select time for each table:
And combined fits (fit itself is wrong of course, don't look at it):
Comparing this to numbers/plots from other tickets:
- I think the numbers are consistent with Cassandra Docker test where we ran 14 instances
- Performance of Scylla seems to be significantly better, approximately 1.5 faster in reading sources, and about twice as fast for forced sources
Regarding Docker comparison - I thought that our Docker setup was very sub-optimal for client/server communication, apparently that was not a bottleneck. Scylla comparison is surprising too, the difference in setup is that for Scylla we combined disks into a single array. OTOH it may be just Scylla is indeed much better. I think I can do another quick tests by joining disks together again to see if it makes a difference for Cassandra.
It looks like there is also per-table option to use row cache which is disabled by default. I'll try to enable it for existing tables and do another run to see if it changes anything.