Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-33070

Research Kafka schemas and schema evolution

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: ts_middleware
    • Labels:
      None

      Description

      Look into schema representations for T&S SAL XML. Confluent Schema Registry supports 3 options: Avro, protobuf, and json schema (the latter two are relatively recent additions).

      We would strongly prefer Avro if we can make it work because that is what the EFD ingester uses, and it would not be easy to change.

        Attachments

          Issue Links

            Activity

            Hide
            rowen Russell Owen added a comment - - edited

            I have attached the results of my research. My conclusion is that we should use Avro because the EFD data ingest code uses it, it has been supported for a long time, and it supports the scalar data types we use (except unsigned ints, which I think we can live without – certainly we have been doing so in the EFD).

            The main difficulties we have to overcome are:

            • Avro has no built-in support for array length constraints. I propose to append _N to our array-valued field names, where N is the fixed array length. It is a very simple solution and the best one I came up with. See my notes for some other options. If the EFD ingest code is changed to put a _ prefix before array index, we end up with field names like Test.arrays.boolen0_5_0 (a 5-element array, index 0), which I think is reasonable.
            • There is no code generator for Python. But it's easy to make one and I think we would be better of doing so in any case. Our needs are modest because our XML is very simple (flat, no fancy datatypes, and no optional fields).
            • We will also have to write new code generators for C++ and Java, or supplement the ones that are available, in order to handle array length constraints. Again, our XML is simple so I think this will be easy.

            json schema is another reasonable choice. Unlike Avro it supports array length constraints. But unlike Avro and protobuf, its support for numeric types is very weak: it just has one kind of int and one kind of float. Between that limitation and the EFD ingest code, I am strongly in favor of using Avro, instead.

            Show
            rowen Russell Owen added a comment - - edited I have attached the results of my research. My conclusion is that we should use Avro because the EFD data ingest code uses it, it has been supported for a long time, and it supports the scalar data types we use (except unsigned ints, which I think we can live without – certainly we have been doing so in the EFD). The main difficulties we have to overcome are: Avro has no built-in support for array length constraints. I propose to append _N to our array-valued field names, where N is the fixed array length. It is a very simple solution and the best one I came up with. See my notes for some other options. If the EFD ingest code is changed to put a _ prefix before array index, we end up with field names like Test.arrays.boolen0_5_0 (a 5-element array, index 0), which I think is reasonable. There is no code generator for Python. But it's easy to make one and I think we would be better of doing so in any case. Our needs are modest because our XML is very simple (flat, no fancy datatypes, and no optional fields). We will also have to write new code generators for C++ and Java, or supplement the ones that are available, in order to handle array length constraints. Again, our XML is simple so I think this will be easy. json schema is another reasonable choice. Unlike Avro it supports array length constraints. But unlike Avro and protobuf, its support for numeric types is very weak: it just has one kind of int and one kind of float. Between that limitation and the EFD ingest code, I am strongly in favor of using Avro, instead.

              People

              Assignee:
              rowen Russell Owen
              Reporter:
              rowen Russell Owen
              Watchers:
              Angelo Fausti, Russell Owen, Tiago Ribeiro
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.