EFD deployment and operation for AuxTel at the summit (Sep 7-14)

XMLWordPrintable

Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Templates:
• Story Points:
4.2
• Team:
SQuaRE

Description

EFD was deployed to 139.229.162.114 on Kubes following (partially) https://sqr-031.lsst.io which needs to be updated.

The EFD Chronograf UI can be reached at https://summit-chronograf-efd.lsst.codes and the EFD InfluxDB HTTP API at https://summit-influxdb-efd.lsst.codes.

Lessons learned from this deployment/operation during the week of Sep 7-14 for the AuxTel activities.

1) single broker deployment of Kafka
The reason for that is to try a unified deployment approach where we have one load balancer in front of each Kafka broker. By controlling that we can have one load balancer (broker) for the single machine deployment on Kubes and 3 or more load balancers (brokers) for the GKE deployment. After changing the obvious configuration on Kafka we still had internal Kafka topics produced with replication factor 3 and I could not make that work. So for this deployment, we are back to DM-20443 using 3 brokers and the NodePort approach.

2) kafka-connect-manager
This was the first time we tried kafka-connect-manager which is responsible to reconfigure the InfluxDB Sink connector as new topics appear in Kafka. That worked pretty well and in the first days of operation when different CSCs were added the connector was auto-configured about 18 times, which is great (see attached screenshot from Kafka connect monitoring)

3) storing missing values on InfluxDB?
This was the first time we tested the EFD with some CSCs, in particular, ATDome.

There's one topic lsst.sal.ATDome.logevent_azimuthCommandedState that was producing NaN values that get serialized to \ufffd by Java/Scala and are not handled by the InfluxDB Sink connector.

 Caused by: org.influxdb.InfluxDBException\$UnableToParseException: partial write: unable to parse 'lsst.sal.ATDome.logevent_azimuthCommandedState ATDomeID=1i,azimuth=�,commandedState=1i,priority=0i,private_host=1994757124i,private_kafkaStamp=1567888128.2887785,private_origin=32i,private_rcvStamp=1567888127.9708674,private_revCode="5544b90a",private_seqNum=1i,private_sndStamp=1567887306.2843742 1567888091465571857': invalid boolean dropped=0 

There was a discussion on where this should be fixed and how to store missing values in InfluxDB. The conclusion is to drop the field with the missing value at the InfluxDB Sink connector and the reason this works in InfluxDB is explained in this notebook:

Here's my conclusion on how we should handle missing values in InfluxDB https://github.com/lsst-sqre/influx-demo/blob/u/afausti/missings/On_storing_missing_values_in_InfluxDB.ipynb

4) High memory usage by Prometheus
Due to limited resources at the Summit for this activity, the machine was not dedicated to the EFD, in particular, other containers for analysis were running there. Our EFD monitoring system uses Prometheus Operator + Grafana to collect metrics from the Kubes cluster - that seems to be using lots of memory (see attached screenshot from Grafana) and we were running at 96% of memory usage. At some point, we reached the memory limit which caused an outage of the EFD during the night of Sep 10.

5) Kafka Connect auto.reset.offset configuration
Tthis is clearly wrong, we are not recovering data from Kafka after restarting the connector. There is also the issue regarding the timestamp we should use in InfluxDB. Right now the InfluxDB timestamp is the time when the data is written to InfluxDB - perhaps {{private_SndStamp} is a better option?? we don't have in SAL the hardware timestamp (when the thing actually happened) for all topics.

A handy feature introduced recently on Chronograf is a button that can be used to display time in local time or UTC that already helped Simon Krughoff and Patrick Ingraham in their analysis.
This version of Chronograf was also deployed to the lab in Tucson.

7) Internet connectivity

yes, we should not rely on internet connectivity at the Summit for critical stuff. In particular we saw this error a couple of times

 java.net.UnknownHostException: summit-influxdb-efd.lsst.codes: Name or service not known 

a lesson learned is to use internal names for the services whenever possible, in this case, http://influxdb-influxdb.influxdb:8086

8) Disk usage, retention policy, and backups

We see a data rate of about 33G/day and our forecast is that the 1T disk available on that machine will fill up in less than a month. We need to think about that and plan backups for these data.

Morning of Sep 10

 Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos-root 914G 222G 692G 25% / 

Morning of Sep 13

 Filesystem Size Used Avail Use% Mounted on /dev/mapper/centos-root 914G 289G 626G 32% / 

9) InfluxDB performance

No issues related to InfluxDB performance during this activity. We recorded data for 266 topics at various frequencies, about 7M messages and 200k queries executed from the monitoring dashboards and also notebook analysis DM-21164 (see screenshot for a summary of the EFD internal monitoring)

People

• Assignee:
Angelo Fausti
Reporter:
Angelo Fausti
Watchers:
Angelo Fausti