# Test more realistic setup of APDB with Cassandra

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
15
• Sprint:
DB_F19_07, DB_F19_10, DB_S20_12, DB_S20_01, DB_S20_02
• Team:
Data Access and Database

#### Description

Initial test with Cassandra (DM-19536) showed that Cassandra data model may be usable but performance on a single-node "cluster" with spinning disk as very far from the goal numbers. To produce a reasonable performance numbers I think we need to have more realistic setup with more than one node (or more than 3 if we want to test replication), plenty of memory on each node and locally-attached SSD storage.

I'm not all sure where or when can we get access to this sort of hardware. Cloud options seem to be expensive, and buying this sort of hardware is not in the current purchase plan. So this ticket is blocked for now by hardware (in-)accessibility.

Important point when we are ready to test is for prototype to produce more realistic data for all DIA tables. Cassandra only stores columns that are specified in a query so we'll need realistic payload to measure performance.

#### Attachments

1. dm-20580-select_real.png
90 kB
2. dm-20580-store_real.png
89 kB
3. dm-20580-store_real-select_real.png
100 kB
4. dm-20580-visit_real.png
77 kB

#### Activity

Hide
Andy Salnikov added a comment -

One clear conclusion from these tests is that we need some serious monitoring of what is happening on server side. Cassandra has a lot of monitoring info exposed via JMX, need to learn how to use that alien technology.

Show
Andy Salnikov added a comment - One clear conclusion from these tests is that we need some serious monitoring of what is happening on server side. Cassandra has a lot of monitoring info exposed via JMX, need to learn how to use that alien technology.
Hide
Andy Salnikov added a comment -

I think this round of tests gave me some ideas for what to do next. Before I close the ticket I want to summarize some of that:

• there is an unexpected/unexplained behavior with writing time being longer than reading, need to understand this
• reading time is not terrible for 1 month of data but scaling it to 12 month will likely need a lot more hardware
• replication impact needs to be understood, we want at least two replicas, and likely we may need 3 to support smooth running with one replica down (I think this is how Cassandra replication works if you need reliable consistency)
• for all of the above we need to monitor things on server side, for that I need to learn how to use JMX tooling
• I may need to rethink month-based partitioning and switch to month-based tables (a.k.a. manual partitioning) instead
• I'm counting on help from Andy Hanushevsky for some of those tasks
Show
Andy Salnikov added a comment - I think this round of tests gave me some ideas for what to do next. Before I close the ticket I want to summarize some of that: there is an unexpected/unexplained behavior with writing time being longer than reading, need to understand this reading time is not terrible for 1 month of data but scaling it to 12 month will likely need a lot more hardware replication impact needs to be understood, we want at least two replicas, and likely we may need 3 to support smooth running with one replica down (I think this is how Cassandra replication works if you need reliable consistency) for all of the above we need to monitor things on server side, for that I need to learn how to use JMX tooling I may need to rethink month-based partitioning and switch to month-based tables (a.k.a. manual partitioning) instead I'm counting on help from Andy Hanushevsky for some of those tasks
Hide
Andy Salnikov added a comment -

One more comment before I close it - from the timing plots I see that saving of DiaObject takes much longer than DiaSource, and naively this difference does not make much sense. There should not be much difference as the number of rows stored for each of those should be about the same. Both tables are rather wide, the difference between them is how they are partitioned and their schema. DiaObject contains few BLOBs in it, which DiaSource does not have, also DiaObject has one column which is always NULL (I did not remove validityEnd from schema yet). We may need to play with schema a bit to understand that. Another thing to understand is client CPU time spent on data transformation, I'm sure the way I do things now may not be super-efficient.

Show
Andy Salnikov added a comment - One more comment before I close it - from the timing plots I see that saving of DiaObject takes much longer than DiaSource, and naively this difference does not make much sense. There should not be much difference as the number of rows stored for each of those should be about the same. Both tables are rather wide, the difference between them is how they are partitioned and their schema. DiaObject contains few BLOBs in it, which DiaSource does not have, also DiaObject has one column which is always NULL (I did not remove validityEnd from schema yet). We may need to play with schema a bit to understand that. Another thing to understand is client CPU time spent on data transformation, I'm sure the way I do things now may not be super-efficient.
Hide
Andy Salnikov added a comment -

All code is on u/andy-slac/cassandra-2 branches, self-reviewed my small additions on this ticket (fill all columns with random numbers) and merged. Closing now.

Show
Andy Salnikov added a comment - All code is on u/andy-slac/cassandra-2 branches, self-reviewed my small additions on this ticket (fill all columns with random numbers) and merged. Closing now.
Hide
Andy Salnikov added a comment - - edited

One more data point - size of the data on disk after 30k visits:

 [salnikov@lsst-qserv-master04 apdb-pdac]$nodetool -u cassandra -pw massandra status apdb Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 141.142.181.162 594.74 GiB 256 39.0% 58c53860-a500-4e5a-904e-44d3b94ecbde rack1 UN 141.142.181.163 693.87 GiB 256 45.7% cc68614d-3312-4e37-8386-6984b53dbbe7 rack1 UN 141.142.181.129 231.88 GiB 96 15.4% e11dce7c-b485-4474-af9a-6c349b4f09c7 rack1  Which adds up to approximately 1.5 TB of data (Load). Data size on disk after full forced compaction and cleaning snapshots  [salnikov@lsst-qserv-master02 ~]$ du -sk /local_data/apdb1/data/ 239074676 /local_data/apdb1/data/   [salnikov@lsst-qserv-master03 ~]$du -sk /local_data/apdb*/data/apdb  152790456 /local_data/apdb2/data/apdb 153664448 /local_data/apdb3/data/apdb 152353444 /local_data/apdb4/data/apdb 153704892 /local_data/apdb5/data/apdb   [salnikov@lsst-qserv-master04 apdb-pdac]$ du -sk /local_data/apdb*/data/apdb  178115000 /local_data/apdb2/data/apdb 174317276 /local_data/apdb3/data/apdb 179012188 /local_data/apdb4/data/apdb 182579792 /local_data/apdb5/data/apdb 

Altogether makes 1.493TB of data (we only have one replica).

Sizes per table reported by nodetool on each host:

 Table master02 master03 master04 Total ----------------------------------------------------- DiaForcedSource 8.55 21.67 25.43 55.65 DiaObject 180.64 465.15 541.41 1,187.19 DiaObjectLast 2.54 6.52 7.59 16.64 DiaSource 35.72 90.80 106.52 233.04 ----------------------------------------------------- Total 227.44 584.14 680.95 1,492.52 

Nodetool reports ~90% compression ratio for large tables, this is probably because all data there is random, we could probably get better compression with actual data.

Here is the table with the number of records per table and node:

 Table master02 master03 master04 Total ----------------------------------------------------------------------------- DiaForcedSource 286,578,627 726,639,675 852,651,695 1,865,869,997 DiaObject 287,151,641 739,427,854 860,651,678 1,887,231,173 DiaObjectLast 26,801,281 68,897,633 80,246,248 175,945,162 DiaSource 68,432,578 173,964,223 204,076,870 446,473,671 ----------------------------------------------------------------------------- Total 668,964,127 1,708,929,385 1,997,626,491 4,375,520,003 

Show
Andy Salnikov added a comment - - edited One more data point - size of the data on disk after 30k visits: [salnikov@lsst-qserv-master04 apdb-pdac]$nodetool -u cassandra -pw massandra status apdb Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 141.142.181.162 594.74 GiB 256 39.0% 58c53860-a500-4e5a-904e-44d3b94ecbde rack1 UN 141.142.181.163 693.87 GiB 256 45.7% cc68614d-3312-4e37-8386-6984b53dbbe7 rack1 UN 141.142.181.129 231.88 GiB 96 15.4% e11dce7c-b485-4474-af9a-6c349b4f09c7 rack1 Which adds up to approximately 1.5 TB of data (Load). Data size on disk after full forced compaction and cleaning snapshots [salnikov@lsst-qserv-master02 ~]$ du -sk /local_data/apdb1/data/ 239074676 /local_data/apdb1/data/   [salnikov@lsst-qserv-master03 ~]$du -sk /local_data/apdb*/data/apdb 152790456 /local_data/apdb2/data/apdb 153664448 /local_data/apdb3/data/apdb 152353444 /local_data/apdb4/data/apdb 153704892 /local_data/apdb5/data/apdb [salnikov@lsst-qserv-master04 apdb-pdac]$ du -sk /local_data/apdb*/data/apdb 178115000 /local_data/apdb2/data/apdb 174317276 /local_data/apdb3/data/apdb 179012188 /local_data/apdb4/data/apdb 182579792 /local_data/apdb5/data/apdb Altogether makes 1.493TB of data (we only have one replica). Sizes per table reported by nodetool on each host: Table master02 master03 master04 Total ----------------------------------------------------- DiaForcedSource 8.55 21.67 25.43 55.65 DiaObject 180.64 465.15 541.41 1,187.19 DiaObjectLast 2.54 6.52 7.59 16.64 DiaSource 35.72 90.80 106.52 233.04 ----------------------------------------------------- Total 227.44 584.14 680.95 1,492.52 Nodetool reports ~90% compression ratio for large tables, this is probably because all data there is random, we could probably get better compression with actual data. Here is the table with the number of records per table and node: Table master02 master03 master04 Total ----------------------------------------------------------------------------- DiaForcedSource 286,578,627 726,639,675 852,651,695 1,865,869,997 DiaObject 287,151,641 739,427,854 860,651,678 1,887,231,173 DiaObjectLast 26,801,281 68,897,633 80,246,248 175,945,162 DiaSource 68,432,578 173,964,223 204,076,870 446,473,671 ----------------------------------------------------------------------------- Total 668,964,127 1,708,929,385 1,997,626,491 4,375,520,003

#### People

Assignee:
Andy Salnikov
Reporter:
Andy Salnikov
Watchers:
Andy Salnikov, Colin Slater, Fritz Mueller