Show
added a comment - - edited One more data point - size of the data on disk after 30k visits:
[salnikov@lsst-qserv-master04 apdb-pdac]$ nodetool -u cassandra -pw massandra status apdb
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 141.142.181.162 594.74 GiB 256 39.0% 58c53860-a500-4e5a-904e-44d3b94ecbde rack1
UN 141.142.181.163 693.87 GiB 256 45.7% cc68614d-3312-4e37-8386-6984b53dbbe7 rack1
UN 141.142.181.129 231.88 GiB 96 15.4% e11dce7c-b485-4474-af9a-6c349b4f09c7 rack1
Which adds up to approximately 1.5 TB of data (Load).
Data size on disk after full forced compaction and cleaning snapshots
[salnikov@lsst-qserv-master02 ~]$ du -sk /local_data/apdb1/data/
239074676 /local_data/apdb1/data/
[salnikov@lsst-qserv-master03 ~]$ du -sk /local_data/apdb*/data/apdb
152790456 /local_data/apdb2/data/apdb
153664448 /local_data/apdb3/data/apdb
152353444 /local_data/apdb4/data/apdb
153704892 /local_data/apdb5/data/apdb
[salnikov@lsst-qserv-master04 apdb-pdac]$ du -sk /local_data/apdb*/data/apdb
178115000 /local_data/apdb2/data/apdb
174317276 /local_data/apdb3/data/apdb
179012188 /local_data/apdb4/data/apdb
182579792 /local_data/apdb5/data/apdb
Altogether makes 1.493TB of data (we only have one replica).
Sizes per table reported by nodetool on each host:
Table master02 master03 master04 Total
-----------------------------------------------------
DiaForcedSource 8.55 21.67 25.43 55.65
DiaObject 180.64 465.15 541.41 1,187.19
DiaObjectLast 2.54 6.52 7.59 16.64
DiaSource 35.72 90.80 106.52 233.04
-----------------------------------------------------
Total 227.44 584.14 680.95 1,492.52
Nodetool reports ~90% compression ratio for large tables, this is probably because all data there is random, we could probably get better compression with actual data.
Here is the table with the number of records per table and node:
Table master02 master03 master04 Total
-----------------------------------------------------------------------------
DiaForcedSource 286,578,627 726,639,675 852,651,695 1,865,869,997
DiaObject 287,151,641 739,427,854 860,651,678 1,887,231,173
DiaObjectLast 26,801,281 68,897,633 80,246,248 175,945,162
DiaSource 68,432,578 173,964,223 204,076,870 446,473,671
-----------------------------------------------------------------------------
Total 668,964,127 1,708,929,385 1,997,626,491 4,375,520,003
One clear conclusion from these tests is that we need some serious monitoring of what is happening on server side. Cassandra has a lot of monitoring info exposed via JMX, need to learn how to use that alien technology.