Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23924

Latency characterization for the Summit EFD with M1M3 +M2 data streams

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Report on the latency characterization work done during the week of Feb 26 to Mar 4 when we first had real M1M3 + M2 data streams at the Summit.

      The first good news is that the EFD runing on a single server with Kubes (k3s) was able to keep up with the M1M3 (50Hz) + M2 (20Hz) data streams.

      We noticed however, the presence of "latency spikes" of ~20s every few minutes and decided to find the root cause and improve that.

      We deployed the kafka control center to help us to monitor Kafka and from that tool it became clear that there was a lag in the consumer group for the Kafka InfluxDB connector where we observed a repetition pattern of ~1000 messages falling behind (connector lag) and then getting restored. 1000 messages at 50Hz correspond to the ~20s latency spikes we observe.

      To reduce the connector lag we added more partitions to the M1M3 and M2 kafka topics which increased the connector throughput and alleviated the problem reducing significantly those latency spikes. From the Kafka Control Center at the Summit https://control-center-summit-efd.lsst.codes/ we are using 18 partitions for M1M3 and M2 topics now. This way we have messages being consumed in parallel and to increase the throughput when writing to InfluxDB we could also increase the number of connector tasks. Currently, we are running only one connector task, but we plan on increasing that soon.

      That already improved the latency removing the ~20s spikes. The latency is now dominated by the SAL Kafka producer which also shows those spike features of ~2s from time to time.

      To further improve latency we can increase even more the number of partitions, increase the number of connector tasks, have multiple connectors for different subsystems if needed, perhaps reproduce in Kafka Connect what we do with the SAL Kafka producers, and if needed increase the size of the Kafka Connect cluster.

        Attachments

          Issue Links

            Activity

            Hide
            afausti Angelo Fausti added a comment -
            Show
            afausti Angelo Fausti added a comment - The notebook used to characterize the EFD latency is here https://github.com/lsst-sqre/notebook-demo/blob/master/experiments/efd/EFD_latency_characterization.ipynb
            Hide
            afausti Angelo Fausti added a comment -

            The different configurations we tried are in the Argo CD EFD deployment configuration repository https://github.com/lsst-sqre/argocd-efd/commits/master

            Show
            afausti Angelo Fausti added a comment - The different configurations we tried are in the Argo CD EFD deployment configuration repository https://github.com/lsst-sqre/argocd-efd/commits/master
            Hide
            afausti Angelo Fausti added a comment -

            Two other things I learned during these tests, we have lots of dashboards queries running against the database at the Summit. To have a better-controlled environment and to be able to reproduce the results I should pause the dashboard queries. Another factor that introduces some latency is connecting to InfluxDB using HTTPS because of the SSL handshake.

            Show
            afausti Angelo Fausti added a comment - Two other things I learned during these tests, we have lots of dashboards queries running against the database at the Summit. To have a better-controlled environment and to be able to reproduce the results I should pause the dashboard queries. Another factor that introduces some latency is connecting to InfluxDB using HTTPS because of the SSL handshake.

              People

              • Assignee:
                afausti Angelo Fausti
                Reporter:
                afausti Angelo Fausti
                Watchers:
                Angelo Fausti
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel