Fix Version/s: None
Team:Data Access and Database
Time to scale Cassandra cluster again, I think 10 is 1 reasonable number. I also want to do 1-year test, for that we expect 24TB of data for 3 replicas, with some overhead 3TB per server node is probably OK. But it may be interesting to look at dynamics after 1 year (if we don't run out of money), so 4TB per node is probably better. I think local SSDs come in multiples of 4, 8*375GiB = 3000GiB, 12*375GiB = 4500GiB. I don't think it's possible to attach more disks without stopping instance which destroys the data, so we need to plan capacity ahead.
- is triggered by
DM-28154 Test APDB Cassandra prototype with 6 server nodes
- links to
Colin, I'm going to add more plots from Grafana soon, there are some interesting info there. For data volume, I think we are consistent with an earlier estimate from DMTN-156 which is 2TiB per 100k visits per replica (estimated 27TiB for 450k visits and actual is 26.5TB from above numbers). For record size - ap_proto is based on DPDD schema, so it should be easy to guess what is size on client side, but representation used for storage or transfer will add some overhead which can be significant. I'll try to see how to estimate per-table storage from monitoring data (or add it to future tests).
Bunch of plots from Grafana, these are all time series covering the whole test. Most of them are based on metrics produced by Cassandra, some come from ap_proto logs (the same data as on the above plots).
Timing of the queries fro selecting and storing (real/wall clock time):
Counts of the object selected per visit and per CCD. First plot shows DiaSource and DiaObject, for DiaObject there are two times shown - before and after filtering. Before filtering is all records returned from a query which includes everything covered by a spatial partitions enclosing CCD region. After filtering is the records inside CCD region. After 12 months DiaSource counts have plateaued, DisObject continues to rise. Second plot includes DiaForcedSource:
Counter of objects stored in each table (again per CCD), this is stable (fluctuation are low as these are averaged over CCDs and multiple visits):
Next plots are all from Cassandra metrics.
Read latenciy for each individual table. Interesting step-like behavior for source tables. I do not know exactly what this latency includes, docs say it's "local" read latency, this probably means latency of reading data on the node where data is located and does not include result collection by coordinator. Regular drops in DiaObject latency are probably due to compaction.
Write latency, looks stable over long periods:
Total data volume for each separate table:
Data compression efficiency for each table type:
Total SSTable count, one logical table consists of one or more SSTable per node. Counts fluctuate due to compaction. Source counters grow with time because we have per-month tables in this setup.
Read repair rate, colorful picture. What is interesting is that it happens mostly in DiaForcedSource tables. I think this can be explained by concurrent read and write into the same table which is sensitive to timing. ap_proto timing is very tight, there is very little delay between reading and writing, in AP pipeline this will probably be less of an issue. And I don't think this is causing issues with current level of repair rate, I do not see indications on other plots that it affects anything.
System load average, I do not see any outstanding nodes like in the previous tests, all looks more or less uniform:
Few more numbers for data size for each individual table (note that we have 3 replicas, size on disk should be divided by 3 to get one replica size). Cassandra compresses data on disk, ap_proto generates random payload for many fields which can affect compression efficiency. Also schema has multiple BLOB columns, calculation below assumes ~20 bytes per BLOB on average.
|Table||Record count||Size on disk||Stored record size||Schema size|
|DiaObject||6.6E9||12.46 TiB||644 B||624 B|
|DiaObjectLast||746 GiB||100 B|
|DiaSource||6.6E9||10.24 TiB||530 B||504 B|
|DiaForcedSource||32.73E9||3.01 TiB||31.4 B||60 B|
I did not count DiaObjectLast records (and cannot reconstruct it from logs).
Stored data sizes seem to be consistent with the compression ratios on the plot above.
One concern that I have is that ever-increasing number of DiaObjects that we read from database. I guess this is probably expected in the current model as there is no cutoff on the history of the DiaObjects. If this continues t grow linearly then timing for that will become important after several years.
Another random thought - DiaObject and DiaForcedSource tables are write-only, but we are going to update DiaSource (detaching them from DiaObject and attaching to SSObject). Cassandra append-only data model could potentially make this more expensive when reading Sources back (which is already the slowest operation). It maybe worth researching an option when relation is not a part of the table itself but a separate very narrow table.
To summarize my findings before I switch to other tests:
- Performance seems to scale reasonably, increasing cluster size two times makes things approximately twice faster (there is overhead on client side which stays constant of course).
- As expected select time after 12 months stays constant for sources tables (which are slowest), but continues to grow for object table (because we store more and more objects per region).
- Cassandra looks very stable, no errors or timeouts observed during whole test, but I had to tweak settings to allow larger limits for timeouts and bulk insert sizes.
Closing this ticket.
This is all really great to see. One thing that would be useful to record here is some sense of the data volumes involved. E.g., what's the size of the result set for a visit, in bytes and number of records, and the data volume on disk after (roughly) N visits, etc. That would help in estimating things like the effective read bandwidth and potential storage cost trade-offs.
Regarding the number of visits, I checked a handful of different opsim runs and they seem to expect about 200k-240k visits per year, depending on the actual survey strategy chosen. That should be a pretty solid estimate, and the big uncertainties like weather tend to be downside risks. So I think your visit model is pretty reasonable. The larger uncertainty is the DIASource counts.