# Test Cassandra APDB implementaion with finer partitioning

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
• Story Points:
12
• Sprint:
DB_S20_02
• Team:
Data Access and Database
• Urgent?:
No

#### Description

This is continuation of tests from DM-23604, for the next round of tests I want to try an approach outlined in that ticket:

• use finer partitioning, MQ3C level=10 could be optimal
• do not filter on pixelId in queries, all filtering is done on client side

This should reduce number of queries that we send to Cassandra and hopefully solve scaling issues that we saw.

#### Attachments

1. apdb-mpi7-counter-select.png
45 kB
2. apdb-mpi7-counter-store.png
35 kB
3. apdb-mpi7-gc-time.png
111 kB
4. apdb-mpi7-nb-time-select-fit.png
89 kB
5. apdb-mpi7-nb-time-select-fsrc.png
68 kB
6. apdb-mpi7-nb-time-select-obj.png
52 kB
7. apdb-mpi7-nb-time-select-src.png
65 kB
8. apdb-mpi7-nb-time-store-fit.png
90 kB
9. apdb-mpi7-time-select.png
63 kB
10. apdb-mpi7-time-store.png
53 kB
11. apdb-mpi8-counter-select.png
60 kB
12. apdb-mpi8-counter-store.png
42 kB
13. apdb-mpi8-gc-time.png
136 kB
14. apdb-mpi8-nb-time-select-fit.png
84 kB
15. apdb-mpi8-nb-time-select-fsrc.png
54 kB
16. apdb-mpi8-nb-time-select-obj.png
57 kB
17. apdb-mpi8-nb-time-select-src.png
55 kB
18. apdb-mpi8-nb-time-store-fit.png
88 kB
19. apdb-mpi8-time-select.png
62 kB
20. apdb-mpi8-time-store.png
81 kB
21. apdb-mpi9-gc-time.png
245 kB
22. apdb-mpi9-nb-time-select-fit.png
86 kB
23. apdb-mpi9-nb-time-select-fsrc.png
54 kB
24. apdb-mpi9-nb-time-select-obj.png
57 kB
25. apdb-mpi9-nb-time-select-src.png
50 kB
26. apdb-mpi9-nb-time-store-fit.png
78 kB
27. apdb-mpi9-time-select.png
71 kB
28. apdb-mpi9-time-store.png
72 kB

#### Activity

Hide
Andy Salnikov added a comment - - edited

Before trying to run multiple cassandra instances per host (with docker) I decided to re-run the test with reduced JVM memory allocation (96GB instead of 160GB, still with G1 GC) and replication factor 2, that means twice as much data per node and twice as much load when storing the data. Some cassandra options were changed too to account for reduced JVM memory. I only ran it for 100k visits but in general this configuration behaved better. I did not see any timeouts at all and GC collection times were more reasonable.

Here are some plots from grafana:

Number of records stored and selected, this numbers are from client side (ap_proto) they are consistent with the above plots:

More interesting grafana plot time to store data per CCD:

The numbers are ~50% higher compared to previous run, this is likely due to replication, still numbers look, significantly lower than read time.

Time to read data (per CCD):

These look consistent with previous run, for reading I user consistency level 1, meaning that response from one replica was requested so it is probably similar to single replica case (but this is not what we should do in production)

And garbage collection time:

There are spikes here too but they are much smaller scale.

and the plots from my notebook with visit numbers for X axis,

There are fewer oultiers here compared to previous case.

And combined fit for store time:

(peculiar behavior, seems that it improves with the visit number, though I don't think it will turn negative )

Show
Andy Salnikov added a comment - - edited Before trying to run multiple cassandra instances per host (with docker) I decided to re-run the test with reduced JVM memory allocation (96GB instead of 160GB, still with G1 GC) and replication factor 2, that means twice as much data per node and twice as much load when storing the data. Some cassandra options were changed too to account for reduced JVM memory. I only ran it for 100k visits but in general this configuration behaved better. I did not see any timeouts at all and GC collection times were more reasonable. Here are some plots from grafana: Number of records stored and selected, this numbers are from client side (ap_proto) they are consistent with the above plots: More interesting grafana plot time to store data per CCD: The numbers are ~50% higher compared to previous run, this is likely due to replication, still numbers look, significantly lower than read time. Time to read data (per CCD): These look consistent with previous run, for reading I user consistency level 1, meaning that response from one replica was requested so it is probably similar to single replica case (but this is not what we should do in production) And garbage collection time: There are spikes here too but they are much smaller scale. and the plots from my notebook with visit numbers for X axis, Read time for three tables: There are fewer oultiers here compared to previous case. Combined fits for read time: And combined fit for store time: (peculiar behavior, seems that it improves with the visit number, though I don't think it will turn negative )
Hide
Andy Salnikov added a comment -

For the next step I want to try to run multiple Cassandra instances per host and also use more reasonable replication and consistency settings. For replication I think the reasonable setting is 3, and for consistency we want level 2. Ask me if you want me to explain what it means and why.

Show
Andy Salnikov added a comment - For the next step I want to try to run multiple Cassandra instances per host and also use more reasonable replication and consistency settings. For replication I think the reasonable setting is 3, and for consistency we want level 2. Ask me if you want me to explain what it means and why.
Hide
Andy Salnikov added a comment - - edited

Some numbers from the latest test with Cassandra, the setup:

• total 14 Cassandra instances on 3 nodes (4+5+5, one per data disk) running in docker containers
• only 3 of those were exposed to public network and could serve as coordinators, this is not an optimal setup, better setup is when clients can connect to all instances (the limit in our case is that Cassandra can only run on fixed port number)
• replication factor is set to 3, client-side consistency level is set to QUORUM, meaning that coordinator node waits for response from 2 instances to return success to client
• I have allocated 32GB initially to each Cassandra JVM and switched to 24GB later (after 10 or 15k visits), there is still some paging activity happening on all machines, not terrible but non-zero
• no timeouts were noticed, it worked reasonably stable, GC timing was significant but without big spikes
• like in previous case 100k visits were generated
• JVM/Cassandra metrics were only collected from 3 instances (same instances that were used as coordinators)
• there was no explicit flush or compaction done at any point

Some plots from grafana (counters look exactly the same, no pint repeating):

Timing for store operations on client side for each separate table (this is per client/CCD):

The numbers look somewhat better than in previous case.

Timing for select operations per table:

This look somewhat slower (~50%) than previous test.

Time for garbage collection (this is Cassandra metrics, per instance, only for 3 out of 14 instances, but they are most busy instances because they also do coordination)

Plots from notebooks, same values but plotted against visit number.

And combined fits for read/store times:

From the fit it looks like total select time is slower by 60% and select time for sources is 55% slower. This is not terrible given that our setup is very inefficient and clearly need to be scaled better.

Show
Andy Salnikov added a comment - - edited Some numbers from the latest test with Cassandra, the setup: total 14 Cassandra instances on 3 nodes (4+5+5, one per data disk) running in docker containers only 3 of those were exposed to public network and could serve as coordinators, this is not an optimal setup, better setup is when clients can connect to all instances (the limit in our case is that Cassandra can only run on fixed port number) replication factor is set to 3, client-side consistency level is set to QUORUM, meaning that coordinator node waits for response from 2 instances to return success to client I have allocated 32GB initially to each Cassandra JVM and switched to 24GB later (after 10 or 15k visits), there is still some paging activity happening on all machines, not terrible but non-zero no timeouts were noticed, it worked reasonably stable, GC timing was significant but without big spikes like in previous case 100k visits were generated JVM/Cassandra metrics were only collected from 3 instances (same instances that were used as coordinators) there was no explicit flush or compaction done at any point Some plots from grafana (counters look exactly the same, no pint repeating): Timing for store operations on client side for each separate table (this is per client/CCD): The numbers look somewhat better than in previous case. Timing for select operations per table: This look somewhat slower (~50%) than previous test. Time for garbage collection (this is Cassandra metrics, per instance, only for 3 out of 14 instances, but they are most busy instances because they also do coordination) Plots from notebooks, same values but plotted against visit number. Read time for three tables: And combined fits for read/store times: From the fit it looks like total select time is slower by 60% and select time for sources is 55% slower. This is not terrible given that our setup is very inefficient and clearly need to be scaled better.
Hide
Andy Salnikov added a comment - - edited

And just for reference this is the data load for the cluster after 100k visits, with replication factor 3 total percentage should be 300%. About 420GB per instance.

 $nodetool -u cassandra -pw ****** status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.2.128 430.76 GiB 256 21.8% 8a9cdb8e-6803-4cbf-ac8c-8d4d85ab0514 rack1 UN 10.0.2.129 419.32 GiB 256 21.2% e96a645a-49d5-47cb-8b38-f5a8aa863e4a rack1 UN 10.0.2.130 419.18 GiB 256 21.2% 2f047aad-311e-4206-82ed-6ebe77c4fda8 rack1 UN 10.0.2.131 427.85 GiB 256 21.6% 3064065e-fa09-4d54-b9e0-d69556609315 rack1 UN 10.0.2.132 419.8 GiB 256 21.2% 8249d921-61b7-468e-a39a-68f45ad75269 rack1 UN 10.0.2.133 412.44 GiB 256 20.8% 504440cd-db85-40dc-93ec-270a5116b4cc rack1 UN 10.0.2.134 410.09 GiB 256 20.7% 6e155a84-80f3-490c-b894-3e73ba594429 rack1 UN 10.0.2.135 428.27 GiB 256 21.6% 3c3f3bbf-118e-451b-9b97-308f6cb40cc9 rack1 UN 10.0.2.136 438.25 GiB 256 22.1% b39cf438-2705-4046-ada5-9f95bb6c2711 rack1 UN 10.0.2.137 444.78 GiB 256 22.5% 8d3b9c58-a3fa-47ae-a3da-8b080a06499c rack1 UN 10.0.2.138 414.29 GiB 256 20.9% 83e6a0c3-7465-408b-bae1-9aa638e23510 rack1 UN 10.0.2.139 429.94 GiB 256 21.6% 0f68c450-2289-4544-a444-8cdba575d982 rack1 UN 10.0.2.125 431.14 GiB 256 21.7% 28392fb5-0f33-4303-96e7-de433d04c6df rack1 UN 10.0.2.127 418.78 GiB 256 21.1% f3f80675-99ca-43c4-b0f4-cdae05edd00c rack1   Show Andy Salnikov added a comment - - edited And just for reference this is the data load for the cluster after 100k visits, with replication factor 3 total percentage should be 300%. About 420GB per instance.$ nodetool -u cassandra -pw ****** status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.2.128 430.76 GiB 256 21.8% 8a9cdb8e-6803-4cbf-ac8c-8d4d85ab0514 rack1 UN 10.0.2.129 419.32 GiB 256 21.2% e96a645a-49d5-47cb-8b38-f5a8aa863e4a rack1 UN 10.0.2.130 419.18 GiB 256 21.2% 2f047aad-311e-4206-82ed-6ebe77c4fda8 rack1 UN 10.0.2.131 427.85 GiB 256 21.6% 3064065e-fa09-4d54-b9e0-d69556609315 rack1 UN 10.0.2.132 419.8 GiB 256 21.2% 8249d921-61b7-468e-a39a-68f45ad75269 rack1 UN 10.0.2.133 412.44 GiB 256 20.8% 504440cd-db85-40dc-93ec-270a5116b4cc rack1 UN 10.0.2.134 410.09 GiB 256 20.7% 6e155a84-80f3-490c-b894-3e73ba594429 rack1 UN 10.0.2.135 428.27 GiB 256 21.6% 3c3f3bbf-118e-451b-9b97-308f6cb40cc9 rack1 UN 10.0.2.136 438.25 GiB 256 22.1% b39cf438-2705-4046-ada5-9f95bb6c2711 rack1 UN 10.0.2.137 444.78 GiB 256 22.5% 8d3b9c58-a3fa-47ae-a3da-8b080a06499c rack1 UN 10.0.2.138 414.29 GiB 256 20.9% 83e6a0c3-7465-408b-bae1-9aa638e23510 rack1 UN 10.0.2.139 429.94 GiB 256 21.6% 0f68c450-2289-4544-a444-8cdba575d982 rack1 UN 10.0.2.125 431.14 GiB 256 21.7% 28392fb5-0f33-4303-96e7-de433d04c6df rack1 UN 10.0.2.127 418.78 GiB 256 21.1% f3f80675-99ca-43c4-b0f4-cdae05edd00c rack1
Hide
Andy Salnikov added a comment - - edited

I'm going to close this ticket, there is enough data and information here already. For summary I think that Cassandra is behaving reasonably given the tiny scale of our cluster. If we are going to use it in production setup we have to run one Cassandra instance per host with all of them on a public (meaning accessible to all clients) network. It looks GC issues can be solved by carefully adjusting GC parameters and using modern G1 GC (and having lots of RAM). NVMe-style SSD seem to work better and Cassandra can manage separate disks itself without RAID complications (and I also read it can be configured to handle disk failures ~transparently).

I'm starting tests with Scylla (C++ re-implementation of Cassandra), that will go to a separate ticket.

Show
Andy Salnikov added a comment - - edited I'm going to close this ticket, there is enough data and information here already. For summary I think that Cassandra is behaving reasonably given the tiny scale of our cluster. If we are going to use it in production setup we have to run one Cassandra instance per host with all of them on a public (meaning accessible to all clients) network. It looks GC issues can be solved by carefully adjusting GC parameters and using modern G1 GC (and having lots of RAM). NVMe-style SSD seem to work better and Cassandra can manage separate disks itself without RAID complications (and I also read it can be configured to handle disk failures ~transparently). I'm starting tests with Scylla (C++ re-implementation of Cassandra), that will go to a separate ticket.

#### People

Assignee:
Andy Salnikov
Reporter:
Andy Salnikov
Watchers:
Andy Hanushevsky, Andy Salnikov, Colin Slater, Fritz Mueller