Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-17550

Update DMTN-093 to describe alert schema versioning

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Alert Production
    • Labels:
      None

      Description

      As developed in DM-17549.

        Attachments

          Issue Links

            Activity

            Hide
            swinbank John Swinbank added a comment -

            Eric Bellm, Jonathan Sick, could I trouble you to take a look at this, please?

            Eric, this is what I think I found out after our discussion this morning. I'd be interested in your take on whether you believe this would be acceptable to alert consumers (brokers, science users, etc).

            Jonathan, by pure coincidence while I was writing this I happened to see review comments on kafkakit whizzing by. I have literally only thought about Avro schema versioning for the last few hours, whereas it looks like you have gone deep into it — would you mind checking if what I've written here is technically plausible?

            Show
            swinbank John Swinbank added a comment - Eric Bellm , Jonathan Sick , could I trouble you to take a look at this, please? Eric, this is what I think I found out after our discussion this morning. I'd be interested in your take on whether you believe this would be acceptable to alert consumers (brokers, science users, etc). Jonathan, by pure coincidence while I was writing this I happened to see review comments on kafkakit whizzing by. I have literally only thought about Avro schema versioning for the last few hours, whereas it looks like you have gone deep into it — would you mind checking if what I've written here is technically plausible? PR here: https://github.com/lsst-dm/dmtn-093/pull/4 Rendered version here: https://dmtn-093.lsst.io/v/DM-17550/index.html
            Hide
            ebellm Eric Bellm added a comment -

            John Swinbank I made a couple comments on Github.

            The major.minor versioning approach seems fine to me.

            I will admit to a little bit of concern about the proposed heavy reliance on the Confluent Schema Registry: it's not part of the Apache open-source portion of Kafka, but falls under their "Community License" (see https://www.confluent.io/download/ and https://www.confluent.io/confluent-community-license-faq). Probably that's fine! But it gives me some pause relative to the other portions.

            Moreover we have to make a schema registry of some sort accessible to everyone (not just brokers, but potentially any users accessing Avro-serialized alerts). Confluent's Schema Registry is itself Kafka-based, and I don't yet understand the implications of exposing it to a wide number of public users. Given that I do not expect more than ~tens of schema changes over the LSST lifetime it may be overkill to deploy it this way as opposed to putting the schemas behind a simpler service (REST or some kind of VO). The Schema Registry makes lots of sense if all users are accessing data from Kafka and within the Confluent framework; that won't be the case for us.

            Show
            ebellm Eric Bellm added a comment - John Swinbank I made a couple comments on Github. The major.minor versioning approach seems fine to me. I will admit to a little bit of concern about the proposed heavy reliance on the Confluent Schema Registry: it's not part of the Apache open-source portion of Kafka, but falls under their "Community License" (see https://www.confluent.io/download/ and https://www.confluent.io/confluent-community-license-faq ). Probably that's fine! But it gives me some pause relative to the other portions. Moreover we have to make a schema registry of some sort accessible to everyone (not just brokers, but potentially any users accessing Avro-serialized alerts). Confluent's Schema Registry is itself Kafka-based, and I don't yet understand the implications of exposing it to a wide number of public users. Given that I do not expect more than ~tens of schema changes over the LSST lifetime it may be overkill to deploy it this way as opposed to putting the schemas behind a simpler service (REST or some kind of VO). The Schema Registry makes lots of sense if all users are accessing data from Kafka and within the Confluent framework; that won't be the case for us.
            Hide
            ebellm Eric Bellm added a comment -

            Ah, looks like the Confluent Registry provides a REST API generically, so maybe that's not an issue.

            Show
            ebellm Eric Bellm added a comment - Ah, looks like the Confluent Registry provides a REST API generically, so maybe that's not an issue.
            Hide
            jsick Jonathan Sick added a comment -

            Neat, I'll take a look today.

            Show
            jsick Jonathan Sick added a comment - Neat, I'll take a look today.
            Hide
            swinbank John Swinbank added a comment -

            Eric Bellm — thanks for your comments!

            I'm marginal on the schema registry thing myself. As we discussed last week, I think some fairly simple logic in the code could handle the majority of the use cases (try parsing with latest schema; if it fails, back off and try the previous version; repeat); we could extend that a bit by using the Confluent wire format to pack a schema version into the packet with the Avro data to make lookups easier.

            However, I was encouraged to use the schema registry in part because (as you point out) it provides an appropriate REST API, and partly because I understand that SQuaRE are deploying it so there will be ready-made on-project expertise.

            I'd be curious to hear if Jonathan Sick has any views on this.

            Show
            swinbank John Swinbank added a comment - Eric Bellm — thanks for your comments! I'm marginal on the schema registry thing myself. As we discussed last week, I think some fairly simple logic in the code could handle the majority of the use cases (try parsing with latest schema; if it fails, back off and try the previous version; repeat); we could extend that a bit by using the Confluent wire format to pack a schema version into the packet with the Avro data to make lookups easier. However, I was encouraged to use the schema registry in part because (as you point out) it provides an appropriate REST API, and partly because I understand that SQuaRE are deploying it so there will be ready-made on-project expertise. I'd be curious to hear if Jonathan Sick has any views on this.
            Hide
            jsick Jonathan Sick added a comment -

            I like this a lot! This is the path we're taking for the DM-EFD (https://sqr-029.lsst.io for the story so far) and for SQuaRE's microservice message bus. For the EFD and SQuaRE stuff we definitely needed the Schema Registry because there are so many schemas (>1000) and we do anticipate frequent schema migrations. For Alerts, event though it might be overkill it does seem better than inventing a new system.

            For users, adopting the Confluent Wire Format does add an extra wrinkle, but the good news is that there's pretty ubiquitous driver support across many languages (https://docs.confluent.io/current/clients/index.html). In Python, there's the confluent_python package or DM's own https://kafkit.lsst.io for asyncio apps. Consumers would need a way to identify the schema regardless (maybe put some bytes in the Kafka message's key?), so adoping the Confluent Wire Format makes sense.

            I like the proposal for a major-minor versioning pattern and to associate a Schema Registry subject with the major version. (We should do that for the DM-EFD too.) For convention, consider matching the root of the subject name with the fully-qualified name (namespace + name fields) of the schema. We're doing that for the DM-EFD. Another thing you might want to talk about is having staging subjects for new schemas. This could be a naming convention on the subject names. Like schemaname-N-dev.  That way you can do end-to-end integration testing of new schemas without committing anything to the "production" subjects.

            One thing you might want to plan for is creating a proxy server the Schema Registry. The proxy would would be publicly accessible (have its own public DNS and ingress) and would match the Schema Registry HTTP API except for maybe two differences. The proxy would be read-only, or alternatively, it would integrate with LSST Auth to allow for administrative access (like adding new schemas). The proxy could also add its own schema caching layer and rate limiting behavior to help prevent a DDoS attack given the public exposure. Confluent also makes a REST Proxy product, but I think it'd be easier to implement a custom proxy to get the LSST Auth integration. The Schema Registry doesn't have a terribly complex API.

            Lastly, a note about schema compatibility from a user's perspective. For most Confluent Wire Format-aware clients, the default behavior is just deserialize the message with the schema associated with the message. To get that extra behavior of dropping new fields, and adding defaults to deleted optional fields, the consumer needs to deserialize with the new schema and then project that data onto the schema the consumer application is built for. That is, you'd use an API like https://fastavro.readthedocs.io/en/latest/reader.html#fastavro._read_py.schemaless_reader:

            reader = schemaless_reader(fh, writer_schema, reader_schema=preferred_schema)
            

            writer_schema is the schema identified in the message. preferred_schema is the schema the application is expecting.

            So taking advantage of the schema migration capability would require some documentation for our users. When a user's application is deployed they'd have to know the ID of the schema they built the app for (their preferred schema). I'm planning on builting this behavior into Kafkit's deserializer, but I haven't seen it elsewhere in the Confluent-based libraries.

            Show
            jsick Jonathan Sick added a comment - I like this a lot! This is the path we're taking for the DM-EFD ( https://sqr-029.lsst.io  for the story so far) and for SQuaRE's microservice message bus. For the EFD and SQuaRE stuff we definitely needed the Schema Registry because there are so many schemas (>1000) and we do anticipate frequent schema migrations. For Alerts, event though it might be overkill it does seem better than inventing a new system. For users, adopting the Confluent Wire Format does add an extra wrinkle, but the good news is that there's pretty ubiquitous driver support across many languages ( https://docs.confluent.io/current/clients/index.html).  In Python, there's the confluent_python package or DM's own https://kafkit.lsst.io  for asyncio apps. Consumers would need a way to identify the schema regardless (maybe put some bytes in the Kafka message's key?), so adoping the Confluent Wire Format makes sense. I like the proposal for a major-minor versioning pattern and to associate a Schema Registry subject with the major version. (We should do that for the DM-EFD too.) For convention, consider matching the root of the subject name with the fully-qualified name ( namespace + name fields) of the schema. We're doing that for the DM-EFD. Another thing you might want to talk about is having staging subjects for new schemas. This could be a naming convention on the subject names. Like schemaname-N-dev .  That way you can do end-to-end integration testing of new schemas without committing anything to the "production" subjects. One thing you might want to plan for is creating a proxy server the Schema Registry. The proxy would would be publicly accessible (have its own public DNS and ingress) and would match the Schema Registry HTTP API except for maybe two differences. The proxy would be read-only, or alternatively, it would integrate with LSST Auth to allow for administrative access (like adding new schemas). The proxy could also add its own schema caching layer and rate limiting behavior to help prevent a DDoS attack given the public exposure. Confluent also makes a REST Proxy product, but I think it'd be easier to implement a custom proxy to get the LSST Auth integration. The Schema Registry doesn't have a terribly complex API. Lastly, a note about schema compatibility from a user's perspective. For most Confluent Wire Format-aware clients, the default behavior is just deserialize the message with the schema associated with the message. To get that extra behavior of dropping new fields, and adding defaults to deleted optional fields, the consumer needs to deserialize with the new schema and then project that data onto the schema the consumer application is built for. That is, you'd use an API like  https://fastavro.readthedocs.io/en/latest/reader.html#fastavro._read_py.schemaless_reader : reader = schemaless_reader(fh, writer_schema, reader_schema = preferred_schema) writer_schema is the schema identified in the message. preferred_schema is the schema the application is expecting. So taking advantage of the schema migration capability would require some documentation for our users. When a user's application is deployed they'd have to know the ID of the schema they built the app for (their preferred schema). I'm planning on builting this behavior into Kafkit's deserializer, but I haven't seen it elsewhere in the Confluent-based libraries.
            Hide
            swinbank John Swinbank added a comment -

            Jonathan Sick — thank you very much for the useful comments! I'm reassured that you don't see any issues with the basic technological choices, and you provide plenty of food for thought for future development. Much appreciated!

            Show
            swinbank John Swinbank added a comment - Jonathan Sick — thank you very much for the useful comments! I'm reassured that you don't see any issues with the basic technological choices, and you provide plenty of food for thought for future development. Much appreciated!
            Hide
            swinbank John Swinbank added a comment -

            Eric Bellm — I've pushed some changes to https://dmtn-093.lsst.io/v/DM-17550/index.html, primarily softening the wording a bit in the hope of addressing your concerns. To what extent have I succeeded?

            Per comments on GitHub, we should probably chat face-to-face about the scope of this document and where we can most effectively record plans and designs without them being regarded as normative.

            Show
            swinbank John Swinbank added a comment - Eric Bellm — I've pushed some changes to https://dmtn-093.lsst.io/v/DM-17550/index.html , primarily softening the wording a bit in the hope of addressing your concerns. To what extent have I succeeded? Per comments on GitHub, we should probably chat face-to-face about the scope of this document and where we can most effectively record plans and designs without them being regarded as normative.
            Hide
            ebellm Eric Bellm added a comment -

            Thanks John Swinbank, I'm happy with these tweaks, and agreed that we should think about where to put interface descriptions for external users of the alert stream.

            Show
            ebellm Eric Bellm added a comment - Thanks John Swinbank , I'm happy with these tweaks, and agreed that we should think about where to put interface descriptions for external users of the alert stream.
            Hide
            swinbank John Swinbank added a comment -

            Thanks both; merged & done.

            Show
            swinbank John Swinbank added a comment - Thanks both; merged & done.

              People

              Assignee:
              swinbank John Swinbank
              Reporter:
              swinbank John Swinbank
              Reviewers:
              Eric Bellm
              Watchers:
              Eric Bellm, John Swinbank, Jonathan Sick
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.