Details
-
Type:
Story
-
Status: To Do
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:2.8
-
Epic Link:
-
Team:SQuaRE
-
Urgent?:No
Description
This ticket describes a failure mode in the EFD that requires more investigation to understand what the proper fix should be.
We lost kube04 node in the NTS cluser (it had to be cordoned). One of the brokers (broker-1) was running there. However the broker-1 pod could not be Terminated and I wasn't able to reschedule it to another node. The only apparent solution was to redeploy Kafka (perhaps that was not really necessary, but I didn't have any path forward at that moment).
[Now that I think about it retrospectively I could have used {{kubectl delete pods <pod> --grace-period=0 --force}} perhaps.]
The problem was that redeploying Kafka removed all schemas from the schema registry but there were still messages in the persisted volumes of broker-0 and broker-2. These messages had the old schema ID recorded on their first bytes.
[I didn't expect the schemas to be removed, they should be preserved in the {{_schemas}} internal topic since there's replication for this topic in the three brokers. Not sure what happened here.]
After Kafka was re-deployed the producers registered the topic schemas in the Schema Registry again. However, there's no guarantee that the schema IDs are preserved for old messages. That explains the mismatch with schema IDs when trying to deserialize the messages and the cluster state is inconsistent.