Bunch of plots from Grafana, these are all time series covering the whole test. Most of them are based on metrics produced by Cassandra, some come from ap_proto logs (the same data as on the above plots).
Timing of the queries fro selecting and storing (real/wall clock time):
Counts of the object selected per visit and per CCD. First plot shows DiaSource and DiaObject, for DiaObject there are two times shown - before and after filtering. Before filtering is all records returned from a query which includes everything covered by a spatial partitions enclosing CCD region. After filtering is the records inside CCD region. After 12 months DiaSource counts have plateaued, DisObject continues to rise. Second plot includes DiaForcedSource:
Counter of objects stored in each table (again per CCD), this is stable (fluctuation are low as these are averaged over CCDs and multiple visits):
Next plots are all from Cassandra metrics.
Read latenciy for each individual table. Interesting step-like behavior for source tables. I do not know exactly what this latency includes, docs say it's "local" read latency, this probably means latency of reading data on the node where data is located and does not include result collection by coordinator. Regular drops in DiaObject latency are probably due to compaction.
Write latency, looks stable over long periods:
Total data volume for each separate table:
Data compression efficiency for each table type:
Total SSTable count, one logical table consists of one or more SSTable per node. Counts fluctuate due to compaction. Source counters grow with time because we have per-month tables in this setup.
Read repair rate, colorful picture. What is interesting is that it happens mostly in DiaForcedSource tables. I think this can be explained by concurrent read and write into the same table which is sensitive to timing. ap_proto timing is very tight, there is very little delay between reading and writing, in AP pipeline this will probably be less of an issue. And I don't think this is causing issues with current level of repair rate, I do not see indications on other plots that it affects anything.
System load average, I do not see any outstanding nodes like in the previous tests, all looks more or less uniform: