Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-20580

Test more realistic setup of APDB with Cassandra

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: L1 Database
    • Labels:
    • Story Points:
      15
    • Sprint:
      DB_F19_07, DB_F19_10, DB_S20_12, DB_S20_01, DB_S20_02
    • Team:
      Data Access and Database

      Description

      Initial test with Cassandra (DM-19536) showed that Cassandra data model may be usable but performance on a single-node "cluster" with spinning disk as very far from the goal numbers. To produce a reasonable performance numbers I think we need to have more realistic setup with more than one node (or more than 3 if we want to test replication), plenty of memory on each node and locally-attached SSD storage.

      I'm not all sure where or when can we get access to this sort of hardware. Cloud options seem to be expensive, and buying this sort of hardware is not in the current purchase plan. So this ticket is blocked for now by hardware (in-)accessibility.

      Important point when we are ready to test is for prototype to produce more realistic data for all DIA tables. Cassandra only stores columns that are specified in a query so we'll need realistic payload to measure performance.

        Attachments

          Issue Links

            Activity

            Hide
            salnikov Andy Salnikov added a comment -

            One clear conclusion from these tests is that we need some serious monitoring of what is happening on server side. Cassandra has a lot of monitoring info exposed via JMX, need to learn how to use that alien technology.

            Show
            salnikov Andy Salnikov added a comment - One clear conclusion from these tests is that we need some serious monitoring of what is happening on server side. Cassandra has a lot of monitoring info exposed via JMX, need to learn how to use that alien technology.
            Hide
            salnikov Andy Salnikov added a comment -

            I think this round of tests gave me some ideas for what to do next. Before I close the ticket I want to summarize some of that:

            • there is an unexpected/unexplained behavior with writing time being longer than reading, need to understand this
            • reading time is not terrible for 1 month of data but scaling it to 12 month will likely need a lot more hardware
            • replication impact needs to be understood, we want at least two replicas, and likely we may need 3 to support smooth running with one replica down (I think this is how Cassandra replication works if you need reliable consistency)
            • for all of the above we need to monitor things on server side, for that I need to learn how to use JMX tooling
            • I may need to rethink month-based partitioning and switch to month-based tables (a.k.a. manual partitioning) instead
            • I'm counting on help from Andy Hanushevsky for some of those tasks
            Show
            salnikov Andy Salnikov added a comment - I think this round of tests gave me some ideas for what to do next. Before I close the ticket I want to summarize some of that: there is an unexpected/unexplained behavior with writing time being longer than reading, need to understand this reading time is not terrible for 1 month of data but scaling it to 12 month will likely need a lot more hardware replication impact needs to be understood, we want at least two replicas, and likely we may need 3 to support smooth running with one replica down (I think this is how Cassandra replication works if you need reliable consistency) for all of the above we need to monitor things on server side, for that I need to learn how to use JMX tooling I may need to rethink month-based partitioning and switch to month-based tables (a.k.a. manual partitioning) instead I'm counting on help from Andy Hanushevsky for some of those tasks
            Hide
            salnikov Andy Salnikov added a comment -

            One more comment before I close it - from the timing plots I see that saving of DiaObject takes much longer than DiaSource, and naively this difference does not make much sense. There should not be much difference as the number of rows stored for each of those should be about the same. Both tables are rather wide, the difference between them is how they are partitioned and their schema. DiaObject contains few BLOBs in it, which DiaSource does not have, also DiaObject has one column which is always NULL (I did not remove validityEnd from schema yet). We may need to play with schema a bit to understand that. Another thing to understand is client CPU time spent on data transformation, I'm sure the way I do things now may not be super-efficient.

            Show
            salnikov Andy Salnikov added a comment - One more comment before I close it - from the timing plots I see that saving of DiaObject takes much longer than DiaSource, and naively this difference does not make much sense. There should not be much difference as the number of rows stored for each of those should be about the same. Both tables are rather wide, the difference between them is how they are partitioned and their schema. DiaObject contains few BLOBs in it, which DiaSource does not have, also DiaObject has one column which is always NULL (I did not remove validityEnd from schema yet). We may need to play with schema a bit to understand that. Another thing to understand is client CPU time spent on data transformation, I'm sure the way I do things now may not be super-efficient.
            Hide
            salnikov Andy Salnikov added a comment -

            All code is on u/andy-slac/cassandra-2 branches, self-reviewed my small additions on this ticket (fill all columns with random numbers) and merged. Closing now.

            Show
            salnikov Andy Salnikov added a comment - All code is on u/andy-slac/cassandra-2 branches, self-reviewed my small additions on this ticket (fill all columns with random numbers) and merged. Closing now.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            One more data point - size of the data on disk after 30k visits:

            [salnikov@lsst-qserv-master04 apdb-pdac]$ nodetool -u cassandra -pw massandra status apdb
            Datacenter: datacenter1
            =======================
            Status=Up/Down
            |/ State=Normal/Leaving/Joining/Moving
            --  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
            UN  141.142.181.162  594.74 GiB  256          39.0%             58c53860-a500-4e5a-904e-44d3b94ecbde  rack1
            UN  141.142.181.163  693.87 GiB  256          45.7%             cc68614d-3312-4e37-8386-6984b53dbbe7  rack1
            UN  141.142.181.129  231.88 GiB  96           15.4%             e11dce7c-b485-4474-af9a-6c349b4f09c7  rack1
            

            Which adds up to approximately 1.5 TB of data (Load).

            Data size on disk after full forced compaction and cleaning snapshots

            [salnikov@lsst-qserv-master02 ~]$ du -sk /local_data/apdb1/data/
            239074676       /local_data/apdb1/data/
             
            [salnikov@lsst-qserv-master03 ~]$ du -sk /local_data/apdb*/data/apdb                                                                                                                        
            152790456       /local_data/apdb2/data/apdb
            153664448       /local_data/apdb3/data/apdb
            152353444       /local_data/apdb4/data/apdb
            153704892       /local_data/apdb5/data/apdb
             
            [salnikov@lsst-qserv-master04 apdb-pdac]$ du -sk /local_data/apdb*/data/apdb                                                                                                                
            178115000       /local_data/apdb2/data/apdb
            174317276       /local_data/apdb3/data/apdb
            179012188       /local_data/apdb4/data/apdb
            182579792       /local_data/apdb5/data/apdb
            

            Altogether makes 1.493TB of data (we only have one replica).

            Sizes per table reported by nodetool on each host:

            Table	        master02 master03 master04 Total
            -----------------------------------------------------
            DiaForcedSource 8.55	21.67	25.43	55.65
            DiaObject       180.64	465.15	541.41	1,187.19
            DiaObjectLast   2.54	6.52	7.59	16.64
            DiaSource       35.72	90.80	106.52	233.04
            -----------------------------------------------------
            Total           227.44	584.14	680.95	1,492.52
            

            Nodetool reports ~90% compression ratio for large tables, this is probably because all data there is random, we could probably get better compression with actual data.

            Here is the table with the number of records per table and node:

            Table           master02	master03	master04	Total
            -----------------------------------------------------------------------------
            DiaForcedSource	286,578,627	726,639,675	852,651,695	1,865,869,997
            DiaObject	287,151,641	739,427,854	860,651,678	1,887,231,173
            DiaObjectLast	26,801,281	68,897,633	80,246,248	175,945,162
            DiaSource	68,432,578	173,964,223	204,076,870	446,473,671
            -----------------------------------------------------------------------------
            Total	        668,964,127	1,708,929,385	1,997,626,491	4,375,520,003
            

             

            Show
            salnikov Andy Salnikov added a comment - - edited One more data point - size of the data on disk after 30k visits: [salnikov@lsst-qserv-master04 apdb-pdac]$ nodetool -u cassandra -pw massandra status apdb Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 141.142.181.162 594.74 GiB 256 39.0% 58c53860-a500-4e5a-904e-44d3b94ecbde rack1 UN 141.142.181.163 693.87 GiB 256 45.7% cc68614d-3312-4e37-8386-6984b53dbbe7 rack1 UN 141.142.181.129 231.88 GiB 96 15.4% e11dce7c-b485-4474-af9a-6c349b4f09c7 rack1 Which adds up to approximately 1.5 TB of data (Load). Data size on disk after full forced compaction and cleaning snapshots [salnikov@lsst-qserv-master02 ~]$ du -sk /local_data/apdb1/data/ 239074676 /local_data/apdb1/data/   [salnikov@lsst-qserv-master03 ~]$ du -sk /local_data/apdb*/data/apdb 152790456 /local_data/apdb2/data/apdb 153664448 /local_data/apdb3/data/apdb 152353444 /local_data/apdb4/data/apdb 153704892 /local_data/apdb5/data/apdb   [salnikov@lsst-qserv-master04 apdb-pdac]$ du -sk /local_data/apdb*/data/apdb 178115000 /local_data/apdb2/data/apdb 174317276 /local_data/apdb3/data/apdb 179012188 /local_data/apdb4/data/apdb 182579792 /local_data/apdb5/data/apdb Altogether makes 1.493TB of data (we only have one replica). Sizes per table reported by nodetool on each host: Table master02 master03 master04 Total ----------------------------------------------------- DiaForcedSource 8.55 21.67 25.43 55.65 DiaObject 180.64 465.15 541.41 1,187.19 DiaObjectLast 2.54 6.52 7.59 16.64 DiaSource 35.72 90.80 106.52 233.04 ----------------------------------------------------- Total 227.44 584.14 680.95 1,492.52 Nodetool reports ~90% compression ratio for large tables, this is probably because all data there is random, we could probably get better compression with actual data. Here is the table with the number of records per table and node: Table master02 master03 master04 Total ----------------------------------------------------------------------------- DiaForcedSource 286,578,627 726,639,675 852,651,695 1,865,869,997 DiaObject 287,151,641 739,427,854 860,651,678 1,887,231,173 DiaObjectLast 26,801,281 68,897,633 80,246,248 175,945,162 DiaSource 68,432,578 173,964,223 204,076,870 446,473,671 ----------------------------------------------------------------------------- Total 668,964,127 1,708,929,385 1,997,626,491 4,375,520,003  

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.