Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28172

Test APDB Cassandra prototype with 12 server nodes

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      10
    • Sprint:
      DB_S21_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Time to scale Cassandra cluster again, I think 10 is 1 reasonable number. I also want to do 1-year test, for that we expect 24TB of data for 3 replicas, with some overhead 3TB per server node is probably OK. But it may be interesting to look at dynamics after 1 year (if we don't run out of money), so 4TB per node is probably better. I think local SSDs come in multiples of 4, 8*375GiB = 3000GiB, 12*375GiB = 4500GiB. I don't think it's possible to attach more disks without stopping instance which destroys the data, so we need to plan capacity ahead.

        Attachments

        1. apdb-gcp4-counter-select-obj-src.png
          apdb-gcp4-counter-select-obj-src.png
          55 kB
        2. apdb-gcp4-counter-select-obj-src-fsrc.png
          apdb-gcp4-counter-select-obj-src-fsrc.png
          60 kB
        3. apdb-gcp4-counter-stored.png
          apdb-gcp4-counter-stored.png
          40 kB
        4. apdb-gcp4-latency-read-diaobjectlast.png
          apdb-gcp4-latency-read-diaobjectlast.png
          129 kB
        5. apdb-gcp4-latency-read-fsrc.png
          apdb-gcp4-latency-read-fsrc.png
          150 kB
        6. apdb-gcp4-latency-read-src.png
          apdb-gcp4-latency-read-src.png
          98 kB
        7. apdb-gcp4-latency-write-diaobject.png
          apdb-gcp4-latency-write-diaobject.png
          95 kB
        8. apdb-gcp4-latency-write-diaobjectlast.png
          apdb-gcp4-latency-write-diaobjectlast.png
          101 kB
        9. apdb-gcp4-latency-write-fsrc.png
          apdb-gcp4-latency-write-fsrc.png
          90 kB
        10. apdb-gcp4-latency-write-src.png
          apdb-gcp4-latency-write-src.png
          84 kB
        11. apdb-gcp4-nb-time-avg.png
          apdb-gcp4-nb-time-avg.png
          42 kB
        12. apdb-gcp4-nb-time-scatter.png
          apdb-gcp4-nb-time-scatter.png
          81 kB
        13. apdb-gcp4-nb-time-select-cpu.png
          apdb-gcp4-nb-time-select-cpu.png
          56 kB
        14. apdb-gcp4-nb-time-select-fit-100k.png
          apdb-gcp4-nb-time-select-fit-100k.png
          77 kB
        15. apdb-gcp4-nb-time-select-fit-50k.png
          apdb-gcp4-nb-time-select-fit-50k.png
          81 kB
        16. apdb-gcp4-nb-time-select-real.png
          apdb-gcp4-nb-time-select-real.png
          57 kB
        17. apdb-gcp4-nb-time-store-cpu.png
          apdb-gcp4-nb-time-store-cpu.png
          58 kB
        18. apdb-gcp4-nb-time-store-cpu-total.png
          apdb-gcp4-nb-time-store-cpu-total.png
          46 kB
        19. apdb-gcp4-nb-time-store-real.png
          apdb-gcp4-nb-time-store-real.png
          52 kB
        20. apdb-gcp4-read-repair-rate-liny.png
          apdb-gcp4-read-repair-rate-liny.png
          114 kB
        21. apdb-gcp4-read-repair-rate-logy.png
          apdb-gcp4-read-repair-rate-logy.png
          145 kB
        22. apdb-gcp4-system-load.png
          apdb-gcp4-system-load.png
          247 kB
        23. apdb-gcp4-table-compression.png
          apdb-gcp4-table-compression.png
          49 kB
        24. apdb-gcp4-table-count.png
          apdb-gcp4-table-count.png
          59 kB
        25. apdb-gcp4-table-size.png
          apdb-gcp4-table-size.png
          54 kB
        26. apdb-gcp4-timing-select-real.png
          apdb-gcp4-timing-select-real.png
          65 kB
        27. apdb-gcp4-timing-store-real.png
          apdb-gcp4-timing-store-real.png
          146 kB

          Issue Links

            Activity

            Hide
            ctslater Colin Slater added a comment -

            This is all really great to see. One thing that would be useful to record here is some sense of the data volumes involved. E.g., what's the size of the result set for a visit, in bytes and number of records, and the data volume on disk after (roughly) N visits, etc. That would help in estimating things like the effective read bandwidth and potential storage cost trade-offs.

            Regarding the number of visits, I checked a handful of different opsim runs and they seem to expect about 200k-240k visits per year, depending on the actual survey strategy chosen. That should be a pretty solid estimate, and the big uncertainties like weather tend to be downside risks. So I think your visit model is pretty reasonable. The larger uncertainty is the DIASource counts.

            Show
            ctslater Colin Slater added a comment - This is all really great to see. One thing that would be useful to record here is some sense of the data volumes involved. E.g., what's the size of the result set for a visit, in bytes and number of records, and the data volume on disk after (roughly) N visits, etc. That would help in estimating things like the effective read bandwidth and potential storage cost trade-offs. Regarding the number of visits, I checked a handful of different opsim runs and they seem to expect about 200k-240k visits per year, depending on the actual survey strategy chosen. That should be a pretty solid estimate, and the big uncertainties like weather tend to be downside risks. So I think your visit model is pretty reasonable. The larger uncertainty is the DIASource counts.
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Colin, I'm going to add more plots from Grafana soon, there are some interesting info there. For data volume, I think we are consistent with an earlier estimate from DMTN-156 which is 2TiB per 100k visits per replica (estimated 27TiB for 450k visits and actual is 26.5TB from above numbers). For record size - ap_proto is based on DPDD schema, so it should be easy to guess what is size on client side, but representation used for storage or transfer will add some overhead which can be significant. I'll try to see how to estimate per-table storage from monitoring data (or add it to future tests).

             

            Show
            salnikov Andy Salnikov added a comment - - edited Colin, I'm going to add more plots from Grafana soon, there are some interesting info there. For data volume, I think we are consistent with an earlier estimate from DMTN-156 which is 2TiB per 100k visits per replica (estimated 27TiB for 450k visits and actual is 26.5TB from above numbers). For record size - ap_proto is based on DPDD schema, so it should be easy to guess what is size on client side, but representation used for storage or transfer will add some overhead which can be significant. I'll try to see how to estimate per-table storage from monitoring data (or add it to future tests).  
            Hide
            salnikov Andy Salnikov added a comment - - edited

            Bunch of plots from Grafana, these are all time series covering the whole test. Most of them are based on metrics produced by Cassandra, some come from ap_proto logs (the same data as on the above plots).

            Timing of the queries fro selecting and storing (real/wall clock time):

            Counts of the object selected per visit and per CCD. First plot shows DiaSource and DiaObject, for DiaObject there are two times shown - before and after filtering. Before filtering is all records returned from a query which includes everything covered by a spatial partitions enclosing CCD region. After filtering is the records inside CCD region. After 12 months DiaSource counts have plateaued, DisObject continues to rise. Second plot includes DiaForcedSource:

            Counter of objects stored in each table (again per CCD), this is stable (fluctuation are low as these are averaged over CCDs and multiple visits):

            Next plots are all from Cassandra metrics.

            Read latenciy for each individual table. Interesting step-like behavior for source tables. I do not know exactly what this latency includes, docs say it's "local" read latency, this probably means latency of reading data on the node where data is located and does not include result collection by coordinator. Regular drops in DiaObject latency are probably due to compaction.


            Write latency, looks stable over long periods:



            Total data volume for each separate table:

            Data compression efficiency for each table type:

            Total SSTable count, one logical table consists of one or more SSTable per node. Counts fluctuate due to compaction. Source counters grow with time because we have per-month tables in this setup.

            Read repair rate, colorful picture. What is interesting is that it happens mostly in DiaForcedSource tables. I think this can be explained by concurrent read and write into the same table which is sensitive to timing. ap_proto timing is very tight, there is very little delay between reading and writing, in AP pipeline this will probably be less of an issue. And I don't think this is causing issues with current level of repair rate, I do not see indications on other plots that it affects anything.

            System load average, I do not see any outstanding nodes like in the previous tests, all looks more or less uniform:

            Show
            salnikov Andy Salnikov added a comment - - edited Bunch of plots from Grafana, these are all time series covering the whole test. Most of them are based on metrics produced by Cassandra, some come from ap_proto logs (the same data as on the above plots). Timing of the queries fro selecting and storing (real/wall clock time): Counts of the object selected per visit and per CCD. First plot shows DiaSource and DiaObject, for DiaObject there are two times shown - before and after filtering. Before filtering is all records returned from a query which includes everything covered by a spatial partitions enclosing CCD region. After filtering is the records inside CCD region. After 12 months DiaSource counts have plateaued, DisObject continues to rise. Second plot includes DiaForcedSource: Counter of objects stored in each table (again per CCD), this is stable (fluctuation are low as these are averaged over CCDs and multiple visits): Next plots are all from Cassandra metrics. Read latenciy for each individual table. Interesting step-like behavior for source tables. I do not know exactly what this latency includes, docs say it's "local" read latency, this probably means latency of reading data on the node where data is located and does not include result collection by coordinator. Regular drops in DiaObject latency are probably due to compaction. Write latency, looks stable over long periods: Total data volume for each separate table: Data compression efficiency for each table type: Total SSTable count, one logical table consists of one or more SSTable per node. Counts fluctuate due to compaction. Source counters grow with time because we have per-month tables in this setup. Read repair rate, colorful picture. What is interesting is that it happens mostly in DiaForcedSource tables. I think this can be explained by concurrent read and write into the same table which is sensitive to timing. ap_proto timing is very tight, there is very little delay between reading and writing, in AP pipeline this will probably be less of an issue. And I don't think this is causing issues with current level of repair rate, I do not see indications on other plots that it affects anything. System load average, I do not see any outstanding nodes like in the previous tests, all looks more or less uniform:
            Hide
            salnikov Andy Salnikov added a comment -

            Few more numbers for data size for each individual table (note that we have 3 replicas, size on disk should be divided by 3 to get one replica size). Cassandra compresses data on disk, ap_proto generates random payload for many fields which can affect compression efficiency. Also schema has multiple BLOB columns, calculation below assumes ~20 bytes per BLOB on average.

            Table Record count Size on disk Stored record size Schema size
            DiaObject ​6.6E9 ​12.46 TiB ​644 B 624 B
            DiaObjectLast   746 GiB   100 B
            DiaSource 6.6E9 10.24 TiB 530 B 504 B
            DiaForcedSource 32.73E9 3.01 TiB 31.4 B 60 B

            I did not count DiaObjectLast records (and cannot reconstruct it from logs).

            Stored data sizes seem to be consistent with the compression ratios on the plot above.

            One concern that I have is that ever-increasing number of DiaObjects that we read from database. I guess this is probably expected in the current model as there is no cutoff on the history of the DiaObjects. If this continues t grow linearly then timing for that will become important after several years.

            Another random thought - DiaObject and DiaForcedSource tables are write-only, but we are going to update DiaSource (detaching them from DiaObject and attaching to SSObject). Cassandra append-only data model could potentially make this more expensive when reading Sources back (which is already the slowest operation). It maybe worth researching an option when relation is not a part of the table itself but a separate very narrow table.

            Show
            salnikov Andy Salnikov added a comment - Few more numbers for data size for each individual table (note that we have 3 replicas, size on disk should be divided by 3 to get one replica size). Cassandra compresses data on disk, ap_proto generates random payload for many fields which can affect compression efficiency. Also schema has multiple BLOB columns, calculation below assumes ~20 bytes per BLOB on average. Table Record count Size on disk Stored record size Schema size DiaObject ​6.6E9 ​12.46 TiB ​644 B 624 B DiaObjectLast   746 GiB   100 B DiaSource 6.6E9 10.24 TiB 530 B 504 B DiaForcedSource 32.73E9 3.01 TiB 31.4 B 60 B I did not count DiaObjectLast records (and cannot reconstruct it from logs). Stored data sizes seem to be consistent with the compression ratios on the plot above. One concern that I have is that ever-increasing number of DiaObjects that we read from database. I guess this is probably expected in the current model as there is no cutoff on the history of the DiaObjects. If this continues t grow linearly then timing for that will become important after several years. Another random thought - DiaObject and DiaForcedSource tables are write-only, but we are going to update DiaSource (detaching them from DiaObject and attaching to SSObject). Cassandra append-only data model could potentially make this more expensive when reading Sources back (which is already the slowest operation). It maybe worth researching an option when relation is not a part of the table itself but a separate very narrow table.
            Hide
            salnikov Andy Salnikov added a comment -

            To summarize my findings before I switch to other tests:

            • Performance seems to scale reasonably, increasing cluster size two times makes things approximately twice faster (there is overhead on client side which stays constant of course).
            • As expected select time after 12 months stays constant for sources tables (which are slowest), but continues to grow for object table (because we store more and more objects per region).
            • Cassandra looks very stable, no errors or timeouts observed during whole test, but I had to tweak settings to allow larger limits for timeouts and bulk insert sizes.

            Closing this ticket.

            Show
            salnikov Andy Salnikov added a comment - To summarize my findings before I switch to other tests: Performance seems to scale reasonably, increasing cluster size two times makes things approximately twice faster (there is overhead on client side which stays constant of course). As expected select time after 12 months stays constant for sources tables (which are slowest), but continues to grow for object table (because we store more and more objects per region). Cassandra looks very stable, no errors or timeouts observed during whole test, but I had to tweak settings to allow larger limits for timeouts and bulk insert sizes. Closing this ticket.

              People

              Assignee:
              salnikov Andy Salnikov
              Reporter:
              salnikov Andy Salnikov
              Watchers:
              Andy Salnikov, Colin Slater, Fritz Mueller
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.