# Issue an automatic restart of the InfluxDB connector if it stops running

XMLWordPrintable

## Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s: None
• Labels:
None
• Story Points:
2.8
• Team:
SQuaRE

## Description

We are getting two different errors that kill the InfluxDB connector, while we are trying to understand the cause and then handle these exceptions in the connector, we can improve the current situation by automatically restarting the connector if it stops running.

this happened on 2019-09-24 06:32:36 UTC

 java.lang.RuntimeException: org.influxdb.InfluxDBIOException: java.net.ConnectException: Failed to connect to influxdb-influxdb.influxdb/10.43.78.18:8086 

and this hapenned on 2019-09-25 05:35:10 UTC

 Caused by: org.apache.kafka.connect.errors.DataException: lsst.sal.ATPneumatics.ackcmd  at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:103)  at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:514)  at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)  at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)  ... 13 more Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 247 

the last one seems to be related with the rebalance happening on the broker:

 [2019-09-25 05:32:23,887] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 65 milliseconds. (kafka.coordinator.group.GroupMetadataManager) [2019-09-25 05:32:25,933] INFO [GroupCoordinator 0]: Member connect-1-069a8f18-bac1-4588-a940-ace48b747bbd in group cp-helm-charts has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator) [2019-09-25 05:32:25,936] INFO [GroupCoordinator 0]: Preparing to rebalance group cp-helm-charts in state PreparingRebalance with old generation 3 (__consumer_offsets-42) (reason: removing member connect-1-069a8f18-bac1-4588-a940-ace48b747bbd on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator) [2019-09-25 05:32:25,937] INFO [GroupCoordinator 0]: Group cp-helm-charts with generation 4 is now empty (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator) [2019-09-25 05:35:03,698] INFO [GroupCoordinator 0]: Preparing to rebalance group cp-helm-charts in state PreparingRebalance with old generation 4 (__consumer_offsets-42) (reason: Adding new member connect-1-6ceaf43a-310a-4be4-9ea5-e1527bdb1ccd) (kafka.coordinator.group.GroupCoordinator) [2019-09-25 05:35:06,699] INFO [GroupCoordinator 0]: Stabilized group cp-helm-charts generation 5 (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator) [2019-09-25 05:35:06,702] INFO [GroupCoordinator 0]: Assignment received from leader for group cp-helm-charts for generation 5 (kafka.coordinator.group.GroupCoordinator) 

## Activity

Hide
Angelo Fausti added a comment - - edited

The solution seems to be as simple as setting these two parameters in the Kafka connect configuration"

 CONNECT_ERRORS_TOLERANCE=all CONNECT_ERRORS_LOG_ENABLE=true 

I have set those and waiting for the next failure to see how it behaves.

We still can have a dead letter queue to put the failing messages on it if we want.

See this post

I've tried to configure that using environment variables in the kafka_connect pod but the config was not picked up, the logs still report:

 errors.tolerance=none errors.log.enable=false 

Show
Angelo Fausti added a comment - - edited The solution seems to be as simple as setting these two parameters in the Kafka connect configuration" CONNECT_ERRORS_TOLERANCE=all CONNECT_ERRORS_LOG_ENABLE=true I have set those and waiting for the next failure to see how it behaves. We still can have a dead letter queue to put the failing messages on it if we want. See this post https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues I've tried to configure that using environment variables in the kafka_connect pod but the config was not picked up, the logs still report: errors.tolerance=none errors.log.enable=false
Hide
Angelo Fausti added a comment - - edited

In the meantime, I've found that there's also a connector specific configuration for error handling:

https://docs.lenses.io/connectors/sink/influx.html#optional-configurations

which I'm going to add to the kafka_connect_manager

I've asked the developers which of the two approaches is better.

Show
Angelo Fausti added a comment - - edited In the meantime, I've found that there's also a connector specific configuration for error handling: https://docs.lenses.io/connectors/sink/influx.html#optional-configurations which I'm going to add to the kafka_connect_manager I've asked the developers which of the two approaches is better.
Hide
Angelo Fausti added a comment - - edited

After discussion with InfluxDB Sink developers on https://lensesio.slack.com/archives/C90QBFWP4/p1569453972094400

error handling configuration must be done on both Kafka connect and InfluxDB Sink connector as they handle different exceptions.

At this point, it is not clear where the errors reported in this ticket are handled. This PR

adds error handling configuration options for the InfluxDB Sink connector on kafka-connect-manager.

kafka-connect-manager version 0.4.0. deployed using the --error-policy NOOP option.

After we learn how the system behaves with that option I will go back to try different Kafka connect error handling configurations.

Show
Angelo Fausti added a comment - - edited After discussion with InfluxDB Sink developers on https://lensesio.slack.com/archives/C90QBFWP4/p1569453972094400 error handling configuration must be done on both Kafka connect and InfluxDB Sink connector as they handle different exceptions. At this point, it is not clear where the errors reported in this ticket are handled. This PR https://github.com/lsst-sqre/kafka-connect-manager/pull/5 adds error handling configuration options for the InfluxDB Sink connector on kafka-connect-manager . kafka-connect-manager version 0.4.0. deployed using the --error-policy NOOP option. After we learn how the system behaves with that option I will go back to try different Kafka connect error handling configurations.
Hide
Angelo Fausti added a comment -

The NOOP error policy seems to be enough and it is working fine at the EFD deployment at the Summit, the connector continues after an error is found e.g.:

 [2019-10-08 23:07:51,818] INFO Recovered from error unable to parse 'lsst.sal.Environment.airPressure EnvironmentID=1i,paAvg1M=�,pateValue3H=�,patrValue3H=�,private_host=378765489i,private_kafkaStamp=1570576108.6848106,private_origin=136i,private_rcvStamp=1570576108.6581392,private_revCode="fb565c9a",private_seqNum=1i,private_sndStamp=1570576108.657355,sensorName="" 1570576071752263072': invalid boolean 

Show
Angelo Fausti added a comment - The NOOP error policy seems to be enough and it is working fine at the EFD deployment at the Summit, the connector continues after an error is found e.g.: [2019-10-08 23:07:51,818] INFO Recovered from error unable to parse 'lsst.sal.Environment.airPressure EnvironmentID=1i,paAvg1M=�,pateValue3H=�,patrValue3H=�,private_host=378765489i,private_kafkaStamp=1570576108.6848106,private_origin=136i,private_rcvStamp=1570576108.6581392,private_revCode="fb565c9a",private_seqNum=1i,private_sndStamp=1570576108.657355,sensorName="" 1570576071752263072': invalid boolean

## People

• Assignee:
Angelo Fausti
Reporter:
Angelo Fausti
Watchers:
Angelo Fausti, Frossie Economou, Patrick Ingraham, Russell Owen, Simon Krughoff