I like this a lot! This is the path we're taking for the DM-EFD (https://sqr-029.lsst.io for the story so far) and for SQuaRE's microservice message bus. For the EFD and SQuaRE stuff we definitely needed the Schema Registry because there are so many schemas (>1000) and we do anticipate frequent schema migrations. For Alerts, event though it might be overkill it does seem better than inventing a new system.
For users, adopting the Confluent Wire Format does add an extra wrinkle, but the good news is that there's pretty ubiquitous driver support across many languages (https://docs.confluent.io/current/clients/index.html). In Python, there's the confluent_python package or DM's own https://kafkit.lsst.io for asyncio apps. Consumers would need a way to identify the schema regardless (maybe put some bytes in the Kafka message's key?), so adoping the Confluent Wire Format makes sense.
I like the proposal for a major-minor versioning pattern and to associate a Schema Registry subject with the major version. (We should do that for the DM-EFD too.) For convention, consider matching the root of the subject name with the fully-qualified name (namespace + name fields) of the schema. We're doing that for the DM-EFD. Another thing you might want to talk about is having staging subjects for new schemas. This could be a naming convention on the subject names. Like schemaname-N-dev. That way you can do end-to-end integration testing of new schemas without committing anything to the "production" subjects.
One thing you might want to plan for is creating a proxy server the Schema Registry. The proxy would would be publicly accessible (have its own public DNS and ingress) and would match the Schema Registry HTTP API except for maybe two differences. The proxy would be read-only, or alternatively, it would integrate with LSST Auth to allow for administrative access (like adding new schemas). The proxy could also add its own schema caching layer and rate limiting behavior to help prevent a DDoS attack given the public exposure. Confluent also makes a REST Proxy product, but I think it'd be easier to implement a custom proxy to get the LSST Auth integration. The Schema Registry doesn't have a terribly complex API.
Lastly, a note about schema compatibility from a user's perspective. For most Confluent Wire Format-aware clients, the default behavior is just deserialize the message with the schema associated with the message. To get that extra behavior of dropping new fields, and adding defaults to deleted optional fields, the consumer needs to deserialize with the new schema and then project that data onto the schema the consumer application is built for. That is, you'd use an API like https://fastavro.readthedocs.io/en/latest/reader.html#fastavro._read_py.schemaless_reader:
reader = schemaless_reader(fh, writer_schema, reader_schema=preferred_schema)
|
writer_schema is the schema identified in the message. preferred_schema is the schema the application is expecting.
So taking advantage of the schema migration capability would require some documentation for our users. When a user's application is deployed they'd have to know the ID of the schema they built the app for (their preferred schema). I'm planning on builting this behavior into Kafkit's deserializer, but I haven't seen it elsewhere in the Confluent-based libraries.
Eric Bellm, Jonathan Sick, could I trouble you to take a look at this, please?
Eric, this is what I think I found out after our discussion this morning. I'd be interested in your take on whether you believe this would be acceptable to alert consumers (brokers, science users, etc).
Jonathan, by pure coincidence while I was writing this I happened to see review comments on kafkakit whizzing by. I have literally only thought about Avro schema versioning for the last few hours, whereas it looks like you have gone deep into it — would you mind checking if what I've written here is technically plausible?