Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29984

Schema evolution in the T&S XML and the EFD

    XMLWordPrintable

    Details

    • Type: Story
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Some notes for future discussion (TDB) on handling schema changes in the T&S XML and the EFD.

      Summary:

      Schema changes are inevitable and we want to make sure we evolve the schema in a way that it doesn't break the connectors we use in the EFD and the queries that we run against InfluxDB and Postgres.

      We have a mechanism to check if a schema change is compatible or not, but that's currently disabled.

      We propose to use FORWARD schema compatibility and enable the schema compatibility checks as part of T&S CI to detect incompatible changes early on.

      Background:

      T&S XML schemas are translated to Avro schemas by the SAL Kafka producer. Every time there's schema change (release of T&S XML), we update the SAL Kafka producer first, and a new version of the Avro schema for a given topic is uploaded to the Kafka Schema Registry.

      In Kafka, when a schema is first created for a subject (topic), it gets a unique id, and it gets a version number. When the schema is updated (if it passes compatibility checks), it gets a new id, and it gets an incremented version number.

      Currently, in the EFD, the Avro schema compatibility checks are disabled.

      For the EFD we want to make sure that after changes in the T&S XML schema:

      • The connectors continue working, i.e, writing to InfluxDB, Postgres, and Parquet files.
      • We don't break any query that use to run against InfluxDB or Postgres.

      So far, we have been using mostly InfluxDB. InfluxDB handles schema changes because fields are optional:

      • If you delete a field in the schema, InfluxDB automatically fills in None values for all subsequent values so that an existing query won't break.
      • If you add a field, InfluxDB automatically fills in None for the previous values, old queries don't use the new field, so that's ok, and new queries can use the new field.

      However, we have noticed some incompatible changes in InfluxDB:

      • If you change the datatype of a field from integer to float for example, this will break the InfluxDB connector (it won't be able to write on the existing schema)

      For the Postgres connector, there is limited support to schema evolution. The connector can only add columns to a table. Still, it cannot remove columns, and I believe that is the same for the Parquet connector.

      The "policy" we agreed verbally with T&S folks is never to change data types of existing fields, delete or rename existing fields but add a new field instead.

      The above means we should be enforcing FORWARD schema compatibility.

      For this to work, we should enable the schema compatibility checks as part of T&S CI to detect incompatible changes early on, i.e., before a T&S XML release.

        Attachments

          Activity

          Hide
          afausti Angelo Fausti added a comment -

          Russell Owen that's a very interesting case. I agree, some breaking changes will have to go through - that seems inevitable. The importance of the schema compatibility checks is that we'll know that in advance. The situation now is that we only know when it breaks in production and that’s what we want to avoid. My proposal above is to enable these checks in the T&S CI so we can plan ahead and make any manual changes in Kafka or in the InfluxDB schema in preparation for that.

          Show
          afausti Angelo Fausti added a comment - Russell Owen that's a very interesting case. I agree, some breaking changes will have to go through - that seems inevitable. The importance of the schema compatibility checks is that we'll know that in advance. The situation now is that we only know when it breaks in production and that’s what we want to avoid. My proposal above is to enable these checks in the T&S CI so we can plan ahead and make any manual changes in Kafka or in the InfluxDB schema in preparation for that.
          Hide
          afausti Angelo Fausti added a comment - - edited

          Tiago Ribeiro yes, if you rename a component you'll get new topics in Kafka and InfluxDB, however, the old topics won't be automatically renamed. It would require manual intervention in Kafka to delete the old topics and in the InfluxDB schema to rename the topics -that’s why we say it “breaks the schema evolution” (the old topics still exist in the EFD but they don’t exist anymore in ts_xml). Now, if we want to rename a component that’s ok, we just need to make these manual changes in advance in preparation for the deployment of the new ts_xml version.

          The schema compatibility checks will add more control to these changes and detect incompatible changes early on.
          The same is true for the type changes - but it is not that they are not allowed - it just that the change won’t be applied automatically to the existing schema in Kafka and in InfluxDB.

          Show
          afausti Angelo Fausti added a comment - - edited Tiago Ribeiro yes, if you rename a component you'll get new topics in Kafka and InfluxDB, however, the old topics won't be automatically renamed. It would require manual intervention in Kafka to delete the old topics and in the InfluxDB schema to rename the topics -that’s why we say it “breaks the schema evolution” (the old topics still exist in the EFD but they don’t exist anymore in ts_xml ). Now, if we want to rename a component that’s ok, we just need to make these manual changes in advance in preparation for the deployment of the new ts_xml version. The schema compatibility checks will add more control to these changes and detect incompatible changes early on. The same is true for the type changes - but it is not that they are not allowed - it just that the change won’t be applied automatically to the existing schema in Kafka and in InfluxDB.
          Hide
          frossie Frossie Economou added a comment - - edited

          Alright so we (me, Angelo Fausti, Simon Krughoff) met with Tony Johnson Russell Owen Michael Reuter and Tiago Ribeiro to clarify a few things and discuss with more bandwidth.

          Summary:

          • We all agree on the problem statement.
          • From the Camera point of view, it sends to the EFD telemetry and configuration. Telemetry is not special for the purposes of this ticket and is included in the discussion below. Configuration has a lot of active schema churn but is also currently of mostly transient value, so we think that we don't have a huge issue with schema non-continuity now, and at some point it will stablise and this problem will go away. The main issue here is that variable names are auto-generated and hence we cannot do the "just rename the variable if you are going to change the type on us" trick.
          • In the general case, the challenge here is to detect schema changes that are not forward compatible ahead of time so they can be resolved (if accidental) or addressed in a planned way if intentional
          • SQuaRE (Angelo) will work with T&S (Tiago) to build in integration tests in the T&S build chain to detect schema breakages before deployment
          • Michael will describe (seperately below) the process by which these integration tests will become part of the T&S software release lifecycle
          • I will schedule some time for this work in the later part of the summer cycle, as responding to this problem is going to be an even bigger headache once we have data going in the consolidated EFD (postgres) and yet another schema to manage. 
          • All of the above will do a lot to mitigate the problems we have seen going forward. Meanwhile we accumulated a lot of historical cruft in the EFD and need to do a spring clean once we have stopped making the problem worse.

          Thanks all, good talk. 

          Show
          frossie Frossie Economou added a comment - - edited Alright so we (me, Angelo Fausti , Simon Krughoff ) met with Tony Johnson Russell Owen Michael Reuter and Tiago Ribeiro to clarify a few things and discuss with more bandwidth. Summary: We all agree on the problem statement. From the Camera point of view, it sends to the EFD telemetry and configuration. Telemetry is not special for the purposes of this ticket and is included in the discussion below. Configuration has a lot of active schema churn but is also currently of mostly transient value, so we think that we don't have a huge issue with schema non-continuity now, and at some point it will stablise and this problem will go away. The main issue here is that variable names are auto-generated and hence we cannot do the "just rename the variable if you are going to change the type on us" trick. In the general case, the challenge here is to detect schema changes that are not forward compatible ahead of time so they can be resolved (if accidental) or addressed in a planned way if intentional SQuaRE (Angelo) will work with T&S (Tiago) to build in integration tests in the T&S build chain to detect schema breakages before deployment Michael will describe (seperately below) the process by which these integration tests will become part of the T&S software release lifecycle I will schedule some time for this work in the later part of the summer cycle , as responding to this problem is going to be an even bigger headache once we have data going in the consolidated EFD (postgres) and yet another schema to manage.  All of the above will do a lot to mitigate the problems we have seen going forward. Meanwhile we accumulated a lot of historical cruft in the EFD and need to do a spring clean once we have stopped making the problem worse. Thanks all, good talk. 
          Hide
          mareuter Michael Reuter added a comment -

          We will integrate a job to determine the schema changes as part of our cycle (release) build process. The schema job will be launched after the base artifact build that makes all of our SAL libraries completes. The report generated from the job will be delivered to SQuaRE so they can plan appropriately for the changes that will show up in deployment. Once we have established this job within the cycle build context, we will make a similar job for our daily and bleed builds that will allow the SAL/XML collective to decide on actions long before changes arrive at SQuaREs doorstep.

          Show
          mareuter Michael Reuter added a comment - We will integrate a job to determine the schema changes as part of our cycle (release) build process. The schema job will be launched after the base artifact build that makes all of our SAL libraries completes. The report generated from the job will be delivered to SQuaRE so they can plan appropriately for the changes that will show up in deployment. Once we have established this job within the cycle build context, we will make a similar job for our daily and bleed builds that will allow the SAL/XML collective to decide on actions long before changes arrive at SQuaREs doorstep.
          Hide
          afausti Angelo Fausti added a comment -

          Still relevant but must be planed for another epic. Moving to my backlog epic for now.

          Show
          afausti Angelo Fausti added a comment - Still relevant but must be planed for another epic. Moving to my backlog epic for now.

            People

            Assignee:
            afausti Angelo Fausti
            Reporter:
            afausti Angelo Fausti
            Watchers:
            Angelo Fausti, Frossie Economou, Michael Reuter, Russell Owen, Simon Krughoff, Tiago Ribeiro, Tony Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:

                Jenkins

                No builds found.