Details
-
Type:
Story
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:10
-
Epic Link:
-
Sprint:DB_S21_12
-
Team:Data Access and Database
-
Urgent?:No
Description
Time to scale Cassandra cluster again, I think 10 is 1 reasonable number. I also want to do 1-year test, for that we expect 24TB of data for 3 replicas, with some overhead 3TB per server node is probably OK. But it may be interesting to look at dynamics after 1 year (if we don't run out of money), so 4TB per node is probably better. I think local SSDs come in multiples of 4, 8*375GiB = 3000GiB, 12*375GiB = 4500GiB. I don't think it's possible to attach more disks without stopping instance which destroys the data, so we need to plan capacity ahead.
Attachments
Attachments
Issue Links
- is triggered by
-
DM-28154 Test APDB Cassandra prototype with 6 server nodes
- Done
- links to
Activity
I managed to make 8 instances with 8 SSDs each, trying to make another instance gives an error:
Quota 'LOCAL_SSD_TOTAL_GB' exceeded. Limit: 25000.0 in region us-central1.
|
After quota has been increased I made 12 server VMs, Cassandra started OK and I'm running my test now. One odd thing that I noticed - JMX monitoring seems to be missing some info on all instance, in particular per-table monitoring info is missing (or actually only exists for ApdbProtoVisits table, but not for any other table). It looks like an issue on Cassandra side this time, need to investigate.
After 50k visits at approximately 2021-01-06 19:10Z I have restarted Cassandra on each node and also JMX monitor. Will see if this fixes monitoring issues.
While I'm watching how things progress slowly, here are couple of ideas to try after this long test:
- in current setup I do time "partitioning" using separate table for each month (30 days), so I am running 13 queries in parallel to retrieve DiaSources. It looks like the data says now that time is proportional to the volume of data and not the number of tables with the data it may still be worth to try coarser partitioning, e.g. one table per 60 days. Or try time partitioning without separate tables (tables are only needed if we want to push older data to slower/cheaper storage).
- DiaSource tables are pretty wide but we never query individual columns so it is possible that we have an overhead storing separate columns and it could be avoided if we pack the data on client side into some BLOB format. Cassandra is probably doing similar thing already but we have larger scale on client side so we can parallelize that more efficiently if we do this packing on client. Most of the columns are never updated, I think only exception is diaObjectId/ssObjectId, so we could store all non-mutable columns into a BLOB and store it as a single column. BLOB format needs to be chosen to be portable (so that non-Python clients could parse it) but efficient. JSON is portable but it sucks, some binary format should work better (BSON, CBOR, etc.)
It ran without issues for 450k visits, stopping it an getting summaries.
Data size on each node:
apdb-server-1: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-1: /dev/nvme0n1p1 375G 267G 108G 72% /data/apdb1
|
apdb-server-1: /dev/nvme0n2p1 375G 264G 111G 71% /data/apdb2
|
apdb-server-1: /dev/nvme0n3p1 375G 266G 110G 71% /data/apdb3
|
apdb-server-1: /dev/nvme0n4p1 375G 265G 111G 71% /data/apdb4
|
apdb-server-1: /dev/nvme0n5p1 375G 263G 113G 70% /data/apdb5
|
apdb-server-1: /dev/nvme0n6p1 375G 265G 111G 71% /data/apdb6
|
apdb-server-1: /dev/nvme0n7p1 375G 262G 114G 70% /data/apdb7
|
apdb-server-1: /dev/nvme0n8p1 375G 270G 106G 72% /data/apdb8
|
apdb-server-2: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-2: /dev/nvme0n1p1 375G 284G 92G 76% /data/apdb1
|
apdb-server-2: /dev/nvme0n2p1 375G 284G 91G 76% /data/apdb2
|
apdb-server-2: /dev/nvme0n3p1 375G 283G 93G 76% /data/apdb3
|
apdb-server-2: /dev/nvme0n4p1 375G 287G 88G 77% /data/apdb4
|
apdb-server-2: /dev/nvme0n5p1 375G 287G 89G 77% /data/apdb5
|
apdb-server-2: /dev/nvme0n6p1 375G 279G 97G 75% /data/apdb6
|
apdb-server-2: /dev/nvme0n7p1 375G 286G 90G 77% /data/apdb7
|
apdb-server-2: /dev/nvme0n8p1 375G 284G 92G 76% /data/apdb8
|
apdb-server-4: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-4: /dev/nvme0n1p1 375G 294G 82G 79% /data/apdb1
|
apdb-server-4: /dev/nvme0n2p1 375G 294G 82G 79% /data/apdb2
|
apdb-server-4: /dev/nvme0n3p1 375G 294G 81G 79% /data/apdb3
|
apdb-server-4: /dev/nvme0n4p1 375G 293G 83G 78% /data/apdb4
|
apdb-server-4: /dev/nvme0n5p1 375G 296G 80G 79% /data/apdb5
|
apdb-server-4: /dev/nvme0n6p1 375G 294G 81G 79% /data/apdb6
|
apdb-server-4: /dev/nvme0n7p1 375G 293G 83G 79% /data/apdb7
|
apdb-server-4: /dev/nvme0n8p1 375G 294G 82G 79% /data/apdb8
|
apdb-server-9: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-9: /dev/nvme0n1p1 375G 316G 60G 85% /data/apdb1
|
apdb-server-9: /dev/nvme0n2p1 375G 311G 65G 83% /data/apdb2
|
apdb-server-9: /dev/nvme0n3p1 375G 311G 65G 83% /data/apdb3
|
apdb-server-9: /dev/nvme0n4p1 375G 313G 63G 84% /data/apdb4
|
apdb-server-9: /dev/nvme0n5p1 375G 313G 63G 84% /data/apdb5
|
apdb-server-9: /dev/nvme0n6p1 375G 312G 64G 84% /data/apdb6
|
apdb-server-9: /dev/nvme0n7p1 375G 312G 64G 83% /data/apdb7
|
apdb-server-9: /dev/nvme0n8p1 375G 308G 68G 83% /data/apdb8
|
apdb-server-10: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-10: /dev/nvme0n1p1 375G 288G 88G 77% /data/apdb1
|
apdb-server-10: /dev/nvme0n2p1 375G 281G 94G 75% /data/apdb2
|
apdb-server-10: /dev/nvme0n3p1 375G 282G 94G 76% /data/apdb3
|
apdb-server-10: /dev/nvme0n4p1 375G 279G 97G 75% /data/apdb4
|
apdb-server-10: /dev/nvme0n5p1 375G 282G 94G 76% /data/apdb5
|
apdb-server-10: /dev/nvme0n6p1 375G 282G 93G 76% /data/apdb6
|
apdb-server-10: /dev/nvme0n7p1 375G 284G 91G 76% /data/apdb7
|
apdb-server-10: /dev/nvme0n8p1 375G 285G 91G 76% /data/apdb8
|
apdb-server-3: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-3: /dev/nvme0n1p1 375G 293G 83G 79% /data/apdb1
|
apdb-server-3: /dev/nvme0n2p1 375G 288G 87G 77% /data/apdb2
|
apdb-server-3: /dev/nvme0n3p1 375G 287G 89G 77% /data/apdb3
|
apdb-server-3: /dev/nvme0n4p1 375G 285G 91G 76% /data/apdb4
|
apdb-server-3: /dev/nvme0n5p1 375G 287G 88G 77% /data/apdb5
|
apdb-server-3: /dev/nvme0n6p1 375G 288G 88G 77% /data/apdb6
|
apdb-server-3: /dev/nvme0n7p1 375G 289G 87G 77% /data/apdb7
|
apdb-server-3: /dev/nvme0n8p1 375G 288G 87G 77% /data/apdb8
|
apdb-server-8: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-8: /dev/nvme0n1p1 375G 284G 92G 76% /data/apdb1
|
apdb-server-8: /dev/nvme0n2p1 375G 287G 89G 77% /data/apdb2
|
apdb-server-8: /dev/nvme0n3p1 375G 288G 88G 77% /data/apdb3
|
apdb-server-8: /dev/nvme0n4p1 375G 286G 90G 77% /data/apdb4
|
apdb-server-8: /dev/nvme0n5p1 375G 283G 93G 76% /data/apdb5
|
apdb-server-8: /dev/nvme0n6p1 375G 287G 89G 77% /data/apdb6
|
apdb-server-8: /dev/nvme0n7p1 375G 285G 91G 76% /data/apdb7
|
apdb-server-8: /dev/nvme0n8p1 375G 292G 83G 78% /data/apdb8
|
apdb-server-5: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-5: /dev/nvme0n1p1 375G 276G 100G 74% /data/apdb1
|
apdb-server-5: /dev/nvme0n2p1 375G 277G 99G 74% /data/apdb2
|
apdb-server-5: /dev/nvme0n3p1 375G 278G 98G 74% /data/apdb3
|
apdb-server-5: /dev/nvme0n4p1 375G 271G 105G 73% /data/apdb4
|
apdb-server-5: /dev/nvme0n5p1 375G 277G 99G 74% /data/apdb5
|
apdb-server-5: /dev/nvme0n6p1 375G 268G 108G 72% /data/apdb6
|
apdb-server-5: /dev/nvme0n7p1 375G 273G 102G 73% /data/apdb7
|
apdb-server-5: /dev/nvme0n8p1 375G 286G 90G 77% /data/apdb8
|
apdb-server-7: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-7: /dev/nvme0n1p1 375G 287G 89G 77% /data/apdb1
|
apdb-server-7: /dev/nvme0n2p1 375G 284G 91G 76% /data/apdb2
|
apdb-server-7: /dev/nvme0n3p1 375G 285G 91G 76% /data/apdb3
|
apdb-server-7: /dev/nvme0n4p1 375G 287G 89G 77% /data/apdb4
|
apdb-server-7: /dev/nvme0n5p1 375G 285G 90G 77% /data/apdb5
|
apdb-server-7: /dev/nvme0n6p1 375G 287G 89G 77% /data/apdb6
|
apdb-server-7: /dev/nvme0n7p1 375G 286G 90G 77% /data/apdb7
|
apdb-server-7: /dev/nvme0n8p1 375G 282G 93G 76% /data/apdb8
|
apdb-server-11: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-11: /dev/nvme0n1p1 375G 269G 107G 72% /data/apdb1
|
apdb-server-11: /dev/nvme0n2p1 375G 269G 106G 72% /data/apdb2
|
apdb-server-11: /dev/nvme0n3p1 375G 266G 110G 71% /data/apdb3
|
apdb-server-11: /dev/nvme0n4p1 375G 266G 110G 71% /data/apdb4
|
apdb-server-11: /dev/nvme0n5p1 375G 268G 108G 72% /data/apdb5
|
apdb-server-11: /dev/nvme0n6p1 375G 268G 107G 72% /data/apdb6
|
apdb-server-11: /dev/nvme0n7p1 375G 267G 109G 72% /data/apdb7
|
apdb-server-11: /dev/nvme0n8p1 375G 268G 107G 72% /data/apdb8
|
apdb-server-6: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-6: /dev/nvme0n1p1 375G 270G 106G 72% /data/apdb1
|
apdb-server-6: /dev/nvme0n2p1 375G 269G 107G 72% /data/apdb2
|
apdb-server-6: /dev/nvme0n3p1 375G 271G 105G 73% /data/apdb3
|
apdb-server-6: /dev/nvme0n4p1 375G 264G 111G 71% /data/apdb4
|
apdb-server-6: /dev/nvme0n5p1 375G 269G 106G 72% /data/apdb5
|
apdb-server-6: /dev/nvme0n6p1 375G 264G 112G 71% /data/apdb6
|
apdb-server-6: /dev/nvme0n7p1 375G 271G 105G 73% /data/apdb7
|
apdb-server-6: /dev/nvme0n8p1 375G 276G 100G 74% /data/apdb8
|
apdb-server-12: Filesystem Size Used Avail Use% Mounted on
|
apdb-server-12: /dev/nvme0n1p1 375G 289G 87G 77% /data/apdb1
|
apdb-server-12: /dev/nvme0n2p1 375G 286G 90G 77% /data/apdb2
|
apdb-server-12: /dev/nvme0n3p1 375G 289G 87G 77% /data/apdb3
|
apdb-server-12: /dev/nvme0n4p1 375G 294G 82G 79% /data/apdb4
|
apdb-server-12: /dev/nvme0n5p1 375G 284G 92G 76% /data/apdb5
|
apdb-server-12: /dev/nvme0n6p1 375G 285G 91G 76% /data/apdb6
|
apdb-server-12: /dev/nvme0n7p1 375G 287G 89G 77% /data/apdb7
|
apdb-server-12: /dev/nvme0n8p1 375G 283G 93G 76% /data/apdb8
|
Total used size is 28452929528 kiB or 26.5 TiB.
$ nodetool status
|
Datacenter: datacenter1
|
=======================
|
Status=Up/Down
|
|/ State=Normal/Leaving/Joining/Moving
|
-- Address Load Tokens Owns (effective) Host ID Rack
|
UN 10.128.0.91 2.29 TiB 256 ? fa91ae2e-2cc4-434a-97e1-b6ae7c6c2137 rack1
|
UN 10.128.0.118 2.22 TiB 256 ? 91128571-d044-4502-b5e4-d463261c70e6 rack1
|
UN 10.128.0.119 2.23 TiB 256 ? e5ea474a-65e4-4c14-bdbe-b45ab28209c2 rack1
|
UN 10.128.0.103 2.43 TiB 256 ? 9cc42ef8-f13c-4023-9385-62104c9737a1 rack1
|
UN 10.128.0.57 2.24 TiB 256 ? 8c1c51c2-bd81-494f-aae2-726c3b8d335d rack1
|
UN 10.128.0.101 2.1 TiB 256 ? bd368072-f7cf-49e7-af49-4ddf5ec52c57 rack1
|
UN 10.128.0.123 2.23 TiB 256 ? 06b1e44f-ecee-49cb-bad3-02a7825e92cb rack1
|
UN 10.128.0.121 2.2 TiB 256 ? fd955a68-4962-4543-8126-6d6ac6d5883f rack1
|
UN 10.128.0.100 2.15 TiB 256 ? a3206eaf-54e1-4ba1-aa28-66ee55c61a66 rack1
|
UN 10.128.0.122 2.08 TiB 256 ? 7c9a6d53-f926-4caf-8da7-6148f0d22e25 rack1
|
UN 10.128.0.52 2.21 TiB 256 ? 07f25954-4374-401a-a9fd-4227e12cdb3a rack1
|
UN 10.128.0.37 2.06 TiB 256 ? fe45f1b3-dc63-4570-9e30-b36f5cd9b5fb rack1
|
Total Load size (26.44 TiB) is consistent with df.
Summary of the test:
- 12 server nodes, 32 cores, 64GB RAM each
- 8x375GiB local SSD storage on each node
- replication factor 3, QUORUM consistency for reads and writes
- 6 client machines, 32 cores each
- 450k visits generated
- no errors observed on client side
The purpose of the test was to understand what happens after we reach 12 months worth of visits, I expect read time to grow ~linearly and then plateau at stable level after 12 months. This is based on what we saw in earlier tests where time grows linearly with the number of visits for source and forced source tables. In my setup I generate 800 per night (average 10 hours with 45 seconds per visit) which is probably an underestimate, I think 900-1000 visits per night may be closer to reality. So in my case 12 months correspond to 288k visits, and 13 months is 312k visit (as we always read 13 months the actual plateau depends where we start writing data compared to time partition boundaries).
This first plot is for first 100k visits, this is just to compare performance with DM-28154:
The growth looks linear again like in previous tests. Here is the comparison with previous 6-node test (DM-28154) for select timing at 100k visits:
6 nodes | 12 nodes | |
---|---|---|
obj_select_real at 100k visits, sec | 0.23 | 0.12 |
src_select_real at 100k visits, sec | 1.96 | 1.13 |
fsrc_select_real at 100k visits, sec | 1.07 | 0.40 |
There is a significant improvement, not exactly twice in case of DIASource, but closer to it for other two tables.
More plots to follow...
Here is the scatter plot for select ans store times. Select time is the sum of real time for database operations (including client time for converting the data into pandas format). Store time is dominated by client CPU time doing conversions from pandas into queries.
This is the same plot but plotting averages across 5k visits:
Clearly read time plateaued after ~300k visits. Variations in the plateau are due to aliasing with 1-month partitioning (we always read complete partitions). Store time stays low though it seems to grow slowly.
Detailed per-table select time plots.
Here is the real time for selects for individual tables (total time is just the sum of three other values):
There is a visible deviation from linearity for both source tables, though total still looks reasonably linear for the first 12 months. Object select time seems to continue growing slowly even after 12 months.
Here is CPU time on client side, which also contributes to above numbers:
CPU time is still significant contribution (I think this explains why there is no factor 2 improvement for DIASource table compared to 6-node cluster). Growth in DIAObject real time is probably explained by growth of CPU time.
And plots for store times for individual tables:
This is real time, per-table time only measures time needed to run the insert queries, not to convert the data from pandas into INSERTs. Total time OTOH includes client time for conversion:
Here is per-table CPU time, this again only covers code that sends queries to Cassandra:
And here is the total CPU time (corresponding to total green markers on real time plot:
This is all really great to see. One thing that would be useful to record here is some sense of the data volumes involved. E.g., what's the size of the result set for a visit, in bytes and number of records, and the data volume on disk after (roughly) N visits, etc. That would help in estimating things like the effective read bandwidth and potential storage cost trade-offs.
Regarding the number of visits, I checked a handful of different opsim runs and they seem to expect about 200k-240k visits per year, depending on the actual survey strategy chosen. That should be a pretty solid estimate, and the big uncertainties like weather tend to be downside risks. So I think your visit model is pretty reasonable. The larger uncertainty is the DIASource counts.
Colin, I'm going to add more plots from Grafana soon, there are some interesting info there. For data volume, I think we are consistent with an earlier estimate from DMTN-156 which is 2TiB per 100k visits per replica (estimated 27TiB for 450k visits and actual is 26.5TB from above numbers). For record size - ap_proto is based on DPDD schema, so it should be easy to guess what is size on client side, but representation used for storage or transfer will add some overhead which can be significant. I'll try to see how to estimate per-table storage from monitoring data (or add it to future tests).
Bunch of plots from Grafana, these are all time series covering the whole test. Most of them are based on metrics produced by Cassandra, some come from ap_proto logs (the same data as on the above plots).
Timing of the queries fro selecting and storing (real/wall clock time):
Counts of the object selected per visit and per CCD. First plot shows DiaSource and DiaObject, for DiaObject there are two times shown - before and after filtering. Before filtering is all records returned from a query which includes everything covered by a spatial partitions enclosing CCD region. After filtering is the records inside CCD region. After 12 months DiaSource counts have plateaued, DisObject continues to rise. Second plot includes DiaForcedSource:
Counter of objects stored in each table (again per CCD), this is stable (fluctuation are low as these are averaged over CCDs and multiple visits):
Next plots are all from Cassandra metrics.
Read latenciy for each individual table. Interesting step-like behavior for source tables. I do not know exactly what this latency includes, docs say it's "local" read latency, this probably means latency of reading data on the node where data is located and does not include result collection by coordinator. Regular drops in DiaObject latency are probably due to compaction.
Write latency, looks stable over long periods:
Total data volume for each separate table:
Data compression efficiency for each table type:
Total SSTable count, one logical table consists of one or more SSTable per node. Counts fluctuate due to compaction. Source counters grow with time because we have per-month tables in this setup.
Read repair rate, colorful picture. What is interesting is that it happens mostly in DiaForcedSource tables. I think this can be explained by concurrent read and write into the same table which is sensitive to timing. ap_proto timing is very tight, there is very little delay between reading and writing, in AP pipeline this will probably be less of an issue. And I don't think this is causing issues with current level of repair rate, I do not see indications on other plots that it affects anything.
System load average, I do not see any outstanding nodes like in the previous tests, all looks more or less uniform:
Few more numbers for data size for each individual table (note that we have 3 replicas, size on disk should be divided by 3 to get one replica size). Cassandra compresses data on disk, ap_proto generates random payload for many fields which can affect compression efficiency. Also schema has multiple BLOB columns, calculation below assumes ~20 bytes per BLOB on average.
Table | Record count | Size on disk | Stored record size | Schema size |
---|---|---|---|---|
DiaObject | 6.6E9 | 12.46 TiB | 644 B | 624 B |
DiaObjectLast | 746 GiB | 100 B | ||
DiaSource | 6.6E9 | 10.24 TiB | 530 B | 504 B |
DiaForcedSource | 32.73E9 | 3.01 TiB | 31.4 B | 60 B |
I did not count DiaObjectLast records (and cannot reconstruct it from logs).
Stored data sizes seem to be consistent with the compression ratios on the plot above.
One concern that I have is that ever-increasing number of DiaObjects that we read from database. I guess this is probably expected in the current model as there is no cutoff on the history of the DiaObjects. If this continues t grow linearly then timing for that will become important after several years.
Another random thought - DiaObject and DiaForcedSource tables are write-only, but we are going to update DiaSource (detaching them from DiaObject and attaching to SSObject). Cassandra append-only data model could potentially make this more expensive when reading Sources back (which is already the slowest operation). It maybe worth researching an option when relation is not a part of the table itself but a separate very narrow table.
To summarize my findings before I switch to other tests:
- Performance seems to scale reasonably, increasing cluster size two times makes things approximately twice faster (there is overhead on client side which stays constant of course).
- As expected select time after 12 months stays constant for sources tables (which are slowest), but continues to grow for object table (because we store more and more objects per region).
- Cassandra looks very stable, no errors or timeouts observed during whole test, but I had to tweak settings to allow larger limits for timeouts and bulk insert sizes.
Closing this ticket.
Actually number of SSDs can be 4, 8, 16, or 24;12 is not possible. 10 VMs with 16 disks cost about the same as 12 VMs with 8 disks. I need to think what other factors may be important. Having too many disks is probably not a nice thing for Cassandra. I think I'll probably start with 12 machines and 8 disks per each. That should be enough capacity for 15 months maybe.