Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25172

Redeploy Kafka on the NCSA test stand and investigations after the cluster intervention

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      On May 28th there was maintenance on the NCSA test stand cluster to replace weave CNI plugin by flannel.

      I have restarted the EFD services on May 28th and it seemed to be working for some CSCs.

      After starting all the CSCs we saw errors like:

      [2020-05-29 18:00:46,813] ERROR WorkerSinkTask{id=influxdb-sink-5} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
      org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
      	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
      	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic lsst.sal.MTM2.logevent_heartbeat to Avro:
      	at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
      	at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
      	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
      	... 13 more
      Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 2091
      Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
      

      That means the schema for a given topic was not registered when the producer started, or that it was registered with a different schema ID than the one used in the serialized message for that topic.

      As a conseguence the connector tasks could not start properly and failed with the error above.

      It looks like Kafka and the Schema Registry got to an inconsistent state and I suspect that happened because of network instability as the producers and the EFD were running during the intervention.

      Anyway, to fix that we stoped the producers and manually deleted that topic from Kafka and the corresponding schema from the registry.

      root@cp-helm-charts-cp-kafka-0:/# kafka-topics --bootstrap-server cp-helm-charts-cp-kafka-headless.cp-helm-charts:9092 --delete --topic lsst.sal.MTM2.logevent_heartbeat
      

      Then we saw the problem for another topic, and repetitively to other topics. I only got the EFD working again after a fresh start i.e. redeploying Kafka removing all the previous topics and schemas

        Attachments

          Activity

          There are no comments yet on this issue.

            People

            • Assignee:
              afausti Angelo Fausti
              Reporter:
              afausti Angelo Fausti
              Watchers:
              Angelo Fausti
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel