Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25172

Redeploy Kafka on the NCSA test stand and investigations after the cluster intervention


    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      On May 28th there was maintenance on the NCSA test stand cluster to replace weave CNI plugin by flannel.

      I have restarted the EFD services on May 28th and it seemed to be working for some CSCs.

      After starting all the CSCs we saw errors like:

      [2020-05-29 18:00:46,813] ERROR WorkerSinkTask{id=influxdb-sink-5} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
      org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
      	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
      	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic lsst.sal.MTM2.logevent_heartbeat to Avro:
      	at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
      	... 13 more
      Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 2091
      Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403

      That means the schema for a given topic was not registered when the producer started, or that it was registered with a different schema ID than the one used in the serialized message for that topic.

      As a conseguence the connector tasks could not start properly and failed with the error above.

      It looks like Kafka and the Schema Registry got to an inconsistent state and I suspect that happened because of network instability as the producers and the EFD were running during the intervention.

      Anyway, to fix that we stoped the producers and manually deleted that topic from Kafka and the corresponding schema from the registry.

      root@cp-helm-charts-cp-kafka-0:/# kafka-topics --bootstrap-server cp-helm-charts-cp-kafka-headless.cp-helm-charts:9092 --delete --topic lsst.sal.MTM2.logevent_heartbeat

      Then we saw the problem for another topic, and repetitively to other topics. I only got the EFD working again after a fresh start i.e. redeploying Kafka removing all the previous topics and schemas



          There are no comments yet on this issue.


            • Assignee:
              afausti Angelo Fausti
              afausti Angelo Fausti
              Angelo Fausti
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created:

                Summary Panel