Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-23606

lsst.alert.packet.Schema improperly duplicates nested schemas

    XMLWordPrintable

    Details

      Description

      Our current alert schemas are nested: the top level alert schema (lsst.alert)relies on secondary schemas (e.g., lsst.diaObject) defined in secondary .avsc files.
      The current lsst.alert.packet.Schema code uses fastavro to parse the schema definitions and custom code (resolve_schema_definition) to join them into a single schema. However, the custom code results in a duplicate copy of each named schema when it is referenced, rather than an initial definition and then later references to the named type.

      This is the cause of the breakage with current versions of fastavro described at https://github.com/lsst-dm/alert_stream/issues/24.

      To fix this, we could debug resolve_schema_definition, fall back to using the avro python 3 library as we previously did, or assist with the relevant fastavro issue.

        Attachments

          Activity

          Hide
          swinbank John Swinbank added a comment - - edited

          PR #8 updates resolve_schema_definition so that it doesn't issue duplicate names. That's a quick & easy workaround for the immediate issue here. However, there are a couple of things to unpack:

          First, this doesn't actually resolve https://github.com/lsst-dm/alert_stream/issues/24 because “bad” schemas (including duplicate names) are serialized with that data. It'll need to be re-written to fix this issue, and I've not done that on this ticket.

          Second, what I've done here is not actually a great solution. What I'm doing is modifying the names of the second (and later) occurrences of some records/enums/fixeds. That means the resulting “expanded” schema ends up with things like lsst.alert.cutout and lsst.alert.cutout1 in it, which is valid from a schema definition point of view, but isn't really compliant with the input schema (which doesn't define lsst.alert.cutout1 at all).

          Unfortunately, I can't see a convenient way around this. The issue is that fastavro keeps a global cache of all the schema items it has seen defined. That means if you define lsst.alert.cutout and refer to it later, you can just look it up in that cache. That's fine... until you want to load more than one version of the schema at once. At that point, the cache will get confused about whether your lsst.alert.cutout refers to the version N schema or the version M schema, and nothing works properly.

          Where does that leave us?

          • I do not believe it is possible to fix resolve_schema_defintion to handle this “properly”; it just doesn't seem to be something that fastavro is designed for. (We might ask the fastavro author about this, as he seems to be pretty responsive.)
          • Similarly, I'm not sure if the upstream fastavro issue actually helps — because I can't see how it can, without a more fundamental redesign — but I've not been able to fully understand the code yet. It might be worth cloning their branch and trying it.
          • If we can show that the “standard” Avro library is not a performance bottleneck for us, then I'd be in favour of using it. Maria did have performance issues in the past, but it might be worth revisiting those. However, I think having both fastavro and standard-Avro in our codebase would be a pretty unfortunate outcome unless we show that's really essential.
          • Of course, if we decided that it was never necessary to load multiple versions of the schema in the same process, a lot of these problems vanish...!
          Show
          swinbank John Swinbank added a comment - - edited PR #8 updates resolve_schema_definition so that it doesn't issue duplicate names. That's a quick & easy workaround for the immediate issue here. However, there are a couple of things to unpack: First, this doesn't actually resolve https://github.com/lsst-dm/alert_stream/issues/24 because “bad” schemas (including duplicate names) are serialized with that data. It'll need to be re-written to fix this issue, and I've not done that on this ticket. Second, what I've done here is not actually a great solution. What I'm doing is modifying the names of the second (and later) occurrences of some records/enums/fixeds. That means the resulting “expanded” schema ends up with things like lsst.alert.cutout and lsst.alert.cutout1 in it, which is valid from a schema definition point of view, but isn't really compliant with the input schema (which doesn't define lsst.alert.cutout1 at all). Unfortunately, I can't see a convenient way around this. The issue is that fastavro keeps a global cache of all the schema items it has seen defined. That means if you define lsst.alert.cutout and refer to it later, you can just look it up in that cache. That's fine... until you want to load more than one version of the schema at once. At that point, the cache will get confused about whether your lsst.alert.cutout refers to the version N schema or the version M schema, and nothing works properly. Where does that leave us? I do not believe it is possible to fix resolve_schema_defintion to handle this “properly”; it just doesn't seem to be something that fastavro is designed for. (We might ask the fastavro author about this, as he seems to be pretty responsive.) Similarly, I'm not sure if the upstream fastavro issue actually helps — because I can't see how it can, without a more fundamental redesign — but I've not been able to fully understand the code yet. It might be worth cloning their branch and trying it. If we can show that the “standard” Avro library is not a performance bottleneck for us, then I'd be in favour of using it. Maria did have performance issues in the past, but it might be worth revisiting those. However, I think having both fastavro and standard-Avro in our codebase would be a pretty unfortunate outcome unless we show that's really essential. Of course, if we decided that it was never necessary to load multiple versions of the schema in the same process, a lot of these problems vanish...!
          Hide
          swinbank John Swinbank added a comment -

          Similarly, I'm not sure if the upstream fastavro issue actually helps — because I can't see how it can, without a more fundamental redesign — but I've not been able to fully understand the code yet. It might be worth cloning their branch and trying it.

          I cloned and tried, and as far as I can see this upstream branch does effectively the same thing as the (old implementation of) resolve_schema_definition — ie, it generates multiple record definitions with the same name, and hence invalid schemas.

          Show
          swinbank John Swinbank added a comment - Similarly, I'm not sure if the upstream fastavro issue actually helps — because I can't see how it can, without a more fundamental redesign — but I've not been able to fully understand the code yet. It might be worth cloning their branch and trying it. I cloned and tried, and as far as I can see this upstream branch does effectively the same thing as the (old implementation of) resolve_schema_definition — ie, it generates multiple record definitions with the same name, and hence invalid schemas.
          Hide
          swinbank John Swinbank added a comment -

          I had a brainwave while listening to Jim talk about galaxy photometry, and realised I was overcomplicating this.

          Pretty sure the PR sample-avro-alert does the Right Thing here, and I've updated the stored data on alert_stream on that basis. Eric Bellm, could you see if you agree, please?

          Show
          swinbank John Swinbank added a comment - I had a brainwave while listening to Jim talk about galaxy photometry, and realised I was overcomplicating this. Pretty sure the PR sample-avro-alert does the Right Thing here, and I've updated the stored data on alert_stream on that basis. Eric Bellm , could you see if you agree, please?
          Hide
          ebellm Eric Bellm added a comment -

          Tested and confirmed that this resolves the issue.

          Show
          ebellm Eric Bellm added a comment - Tested and confirmed that this resolves the issue.

            People

            Assignee:
            swinbank John Swinbank
            Reporter:
            ebellm Eric Bellm
            Reviewers:
            Eric Bellm
            Watchers:
            Eric Bellm, John Swinbank
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.