Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21436

Issue an automatic restart of the InfluxDB connector if it stops running

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We are getting two different errors that kill the InfluxDB connector, while we are trying to understand the cause and then handle these exceptions in the connector, we can improve the current situation by automatically restarting the connector if it stops running.

      this happened on 2019-09-24 06:32:36 UTC

      java.lang.RuntimeException: org.influxdb.InfluxDBIOException: java.net.ConnectException: Failed to connect to influxdb-influxdb.influxdb/10.43.78.18:8086
      

      and this hapenned on 2019-09-25 05:35:10 UTC

      Caused by: org.apache.kafka.connect.errors.DataException: lsst.sal.ATPneumatics.ackcmd
          at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:103)
          at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:514)
          at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
          at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
          ... 13 more
      Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 247
      

      the last one seems to be related with the rebalance happening on the broker:

      [2019-09-25 05:32:23,887] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 65 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
      [2019-09-25 05:32:25,933] INFO [GroupCoordinator 0]: Member connect-1-069a8f18-bac1-4588-a940-ace48b747bbd in group cp-helm-charts has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
      [2019-09-25 05:32:25,936] INFO [GroupCoordinator 0]: Preparing to rebalance group cp-helm-charts in state PreparingRebalance with old generation 3 (__consumer_offsets-42) (reason: removing member connect-1-069a8f18-bac1-4588-a940-ace48b747bbd on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
      [2019-09-25 05:32:25,937] INFO [GroupCoordinator 0]: Group cp-helm-charts with generation 4 is now empty (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator)
      [2019-09-25 05:35:03,698] INFO [GroupCoordinator 0]: Preparing to rebalance group cp-helm-charts in state PreparingRebalance with old generation 4 (__consumer_offsets-42) (reason: Adding new member connect-1-6ceaf43a-310a-4be4-9ea5-e1527bdb1ccd) (kafka.coordinator.group.GroupCoordinator)
      [2019-09-25 05:35:06,699] INFO [GroupCoordinator 0]: Stabilized group cp-helm-charts generation 5 (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator)
      [2019-09-25 05:35:06,702] INFO [GroupCoordinator 0]: Assignment received from leader for group cp-helm-charts for generation 5 (kafka.coordinator.group.GroupCoordinator)
      

        Attachments

          Activity

          Hide
          afausti Angelo Fausti added a comment - - edited

          The solution seems to be as simple as setting these two parameters in the Kafka connect configuration"

          CONNECT_ERRORS_TOLERANCE=all
          CONNECT_ERRORS_LOG_ENABLE=true
          

          I have set those and waiting for the next failure to see how it behaves.

          We still can have a dead letter queue to put the failing messages on it if we want.

          See this post

          https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues

          I've tried to configure that using environment variables in the kafka_connect pod but the config was not picked up, the logs still report:

          errors.tolerance=none
          errors.log.enable=false
          

          Show
          afausti Angelo Fausti added a comment - - edited The solution seems to be as simple as setting these two parameters in the Kafka connect configuration" CONNECT_ERRORS_TOLERANCE=all CONNECT_ERRORS_LOG_ENABLE=true I have set those and waiting for the next failure to see how it behaves. We still can have a dead letter queue to put the failing messages on it if we want. See this post https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues I've tried to configure that using environment variables in the kafka_connect pod but the config was not picked up, the logs still report: errors.tolerance=none errors.log.enable=false
          Hide
          afausti Angelo Fausti added a comment - - edited

          In the meantime, I've found that there's also a connector specific configuration for error handling:

          https://docs.lenses.io/connectors/sink/influx.html#optional-configurations

          which I'm going to add to the kafka_connect_manager

          I've asked the developers which of the two approaches is better.

          Show
          afausti Angelo Fausti added a comment - - edited In the meantime, I've found that there's also a connector specific configuration for error handling: https://docs.lenses.io/connectors/sink/influx.html#optional-configurations which I'm going to add to the kafka_connect_manager I've asked the developers which of the two approaches is better.
          Hide
          afausti Angelo Fausti added a comment - - edited

          After discussion with InfluxDB Sink developers on https://lensesio.slack.com/archives/C90QBFWP4/p1569453972094400

          error handling configuration must be done on both Kafka connect and InfluxDB Sink connector as they handle different exceptions.

          At this point, it is not clear where the errors reported in this ticket are handled. This PR

          https://github.com/lsst-sqre/kafka-connect-manager/pull/5

          adds error handling configuration options for the InfluxDB Sink connector on kafka-connect-manager.

          kafka-connect-manager version 0.4.0. deployed using the --error-policy NOOP option.

          After we learn how the system behaves with that option I will go back to try different Kafka connect error handling configurations.

          Show
          afausti Angelo Fausti added a comment - - edited After discussion with InfluxDB Sink developers on https://lensesio.slack.com/archives/C90QBFWP4/p1569453972094400 error handling configuration must be done on both Kafka connect and InfluxDB Sink connector as they handle different exceptions. At this point, it is not clear where the errors reported in this ticket are handled. This PR https://github.com/lsst-sqre/kafka-connect-manager/pull/5 adds error handling configuration options for the InfluxDB Sink connector on kafka-connect-manager . kafka-connect-manager version 0.4.0. deployed using the --error-policy NOOP option. After we learn how the system behaves with that option I will go back to try different Kafka connect error handling configurations.
          Hide
          afausti Angelo Fausti added a comment -

          The NOOP error policy seems to be enough and it is working fine at the EFD deployment at the Summit, the connector continues after an error is found e.g.:

          [2019-10-08 23:07:51,818] INFO Recovered from error unable to parse 'lsst.sal.Environment.airPressure EnvironmentID=1i,paAvg1M=�,pateValue3H=�,patrValue3H=�,private_host=378765489i,private_kafkaStamp=1570576108.6848106,private_origin=136i,private_rcvStamp=1570576108.6581392,private_revCode="fb565c9a",private_seqNum=1i,private_sndStamp=1570576108.657355,sensorName="" 1570576071752263072': invalid boolean
          

          Show
          afausti Angelo Fausti added a comment - The NOOP error policy seems to be enough and it is working fine at the EFD deployment at the Summit, the connector continues after an error is found e.g.: [2019-10-08 23:07:51,818] INFO Recovered from error unable to parse 'lsst.sal.Environment.airPressure EnvironmentID=1i,paAvg1M=�,pateValue3H=�,patrValue3H=�,private_host=378765489i,private_kafkaStamp=1570576108.6848106,private_origin=136i,private_rcvStamp=1570576108.6581392,private_revCode="fb565c9a",private_seqNum=1i,private_sndStamp=1570576108.657355,sensorName="" 1570576071752263072': invalid boolean

            People

            • Assignee:
              afausti Angelo Fausti
              Reporter:
              afausti Angelo Fausti
              Watchers:
              Angelo Fausti, Frossie Economou, Patrick Ingraham, Russell Owen, Simon Krughoff
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel