Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-699

Add backwards compatibility testing and associated repo(s)

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      Several times in the past few years, we have introduced problems that new versions of the stack cannot read data processed with older versions of the stack (e.g., DM-10302, DM-23584). This is partly because of a lack of testing of reading old data with new versions of the stack; I believe this is also because a "guarantee" of reading of old data is not explicitly in the dev guide.

      This RFC proposes to remedy these problems by (a) making it explicit that we will support reading some subset of old data; and (b) add a new repo (or repos) that will be tested regularly in Jenkins. These tests should be lightweight enough to be tested with every Jenkins run. This repo will contain a "blessed" set of older data. At a minimum, these data should include samples of data that are available in /datasets on lsst-dev. These data have already been "blessed" via RFC process to be worthy of being generally publicized to stack developers. This RFC would make the implicit promise that these data are readable with future versions of the stack explicit and tested.

      Specifically, as a straw-man proposal I would propose:

      • Two repos, backcomp and testdata_backcomp (yes, these are horrible names, this is a straw-man.)
      • The testdata_backcomp would be setup optional, LFS-backed, with samples of files from HSC RC2 processing; ImSim DC2 processing; DECam processing; etc (if it is in /datasets it should be included unless there is a reason not to).
      • These sample files will include something like a single example of raw, flat, dark, bias, fringe, calexp, calexp_background, src, skyCorr, deepCoadd_calexp, deepCoadd_meas for each obs set / generation of processing that we support.
      • The test repo will have very basic tests: load a butler, get various datasets, and perhaps do the minimum of manipulation (realize a background; check the coadd input table; load PSFs and WCSs)
      • We can start by adding HSC RC2, ImSim DC2, and DECam, and expand from there.
      • It will be the responsibility of the developer adding new blessed data to /datasets to add a suitable sample to these repos to ensure that these data are readable in the future. This should hopefully be easy, and also worthwhile because spending a few minutes up front to add data here will save a lot of pain further on.
      • The decision to remove old datasets from this testing suite, explicitly breaking backwards compatibility promises, should have an RFC and an associated community post.

        Attachments

          Issue Links

            Activity

            Hide
            gruendl Robert Gruendl [X] (Inactive) added a comment -

            In the current Gen3 development is the concept that changes will be accompanied by a utility to migrate existing repositories when the need arises.   Once Gen3 is accepted (and Gen2 has been fully deprecated) updates to Gen3 will fall under the Change Control Board (CCB).  I think rather than guarantee that DM will not "break" existing repositories it should be left to the CCB to decide whether changes should be allowed that would "break" existing repos (i.e. that in some unforeseen case that developers would have to demonstrate that providing migration support is so onerous that a change be allowed that would require users to wholesale abandon existing repos). 

            Show
            gruendl Robert Gruendl [X] (Inactive) added a comment - In the current Gen3 development is the concept that changes will be accompanied by a utility to migrate existing repositories when the need arises.   Once Gen3 is accepted (and Gen2 has been fully deprecated) updates to Gen3 will fall under the Change Control Board (CCB).  I think rather than guarantee that DM will not "break" existing repositories it should be left to the CCB to decide whether changes should be allowed that would "break" existing repos (i.e. that in some unforeseen case that developers would have to demonstrate that providing migration support is so onerous that a change be allowed that would require users to wholesale abandon existing repos). 
            Hide
            swinbank John Swinbank added a comment -

            We discussed this in the DM-CCB of 2020-07-29. We think the next step on this RFC is to capture a proposed change to the Developer Guide so that we can comment and agree upon the detailed wording resulting from this RFC. Eli Rykoff, as the author of the RFC, would you like to file a ticket/open a PR against dm_dev_guide for this?

            Show
            swinbank John Swinbank added a comment - We discussed this in the DM-CCB of 2020-07-29. We think the next step on this RFC is to capture a proposed change to the Developer Guide so that we can comment and agree upon the detailed wording resulting from this RFC. Eli Rykoff , as the author of the RFC, would you like to file a ticket/open a PR against dm_dev_guide for this?
            Hide
            ktl Kian-Tat Lim added a comment -

            We further discussed this in the DMCCB of 2020-12-02. We think that detailed text for the Dev Guide on datasets for testing, including where they live and how to add to them, depends on RFC-741 for Gen3 (and that Gen2 should not be a major concern), but a paragraph laying out the principles espoused by this RFC can be written. Jim Bosch will do so on this RFC, allowing it to be adopted and implemented; a subsequent RFC (after RFC-741) can then add the details.

            Show
            ktl Kian-Tat Lim added a comment - We further discussed this in the DMCCB of 2020-12-02. We think that detailed text for the Dev Guide on datasets for testing, including where they live and how to add to them, depends on RFC-741 for Gen3 (and that Gen2 should not be a major concern), but a paragraph laying out the principles espoused by this RFC can be written. Jim Bosch will do so on this RFC, allowing it to be adopted and implemented; a subsequent RFC (after RFC-741 ) can then add the details.
            Hide
            jbosch Jim Bosch added a comment -

            I think the main principles to capture here are:

            • We will have a testdata_backcomp or similar git LFS repository containing files and one or more Gen3 data repository databases, with detailed structure TBD, and a backcomp package containing test scripts that attempt to load the data from the testdata package and inspect it via (Gen3) butler interfaces. I think we will need for the test data to simultaneously contain multiple repository versions - at least one "stable" that we need to continue to be able to read, and a "main" repo that may differ from that - but whether this is handled by e.g. multiple directories, multiple branches, or multiple testdata packages is TBD. At least at first, I envision these repositories being backed by PosixDatastore and SQLite databases, as we expect backwards compatibility issues that are specific to other datastores or database backends to not be dataset-specific.
            • When a new public dataset type and/or new file content (e.g. a new Exposure component, or type thereof) is added to the Pipeline used in or more of our "standard" reprocessing runs (HSC RC2, DM's DC2 subset, and ap_verify, right now), or added to the shared /datasets repository via an ingest procedure, an instance of that dataset type and/or file conent must be added to the "main" repo of testdata_backcomp and a test is added to backcomp as well. The backcomp package will contain scripts to make importing that new data from the /datasets repository easier, as well as test utility code to make most new test scripts trivial variants on existing ones. This tooling will take some time to develop and refine, and may not be present in the first version of the backcomp package. Adding new content to the backwards compatibility system does not require an RFC.
            • Non-public dataset types may also exist; these may be configurable diagnostics or an implementation detail shared by closely-related PipelineTasks. I probably need to go add some way to mark those on RFC-741. These do not need to be included in the backwards compatibility system, as long as they do not affect the code used to read public datasets.
            • Dropping support for reading some kind of file content or removing a public dataset type from a standard-reprocessing Pipeline requires an RFC and generally a full-release deprecation period. Our code interface deprecation policy applies here, and CCB approval is needed to skip the deprecation period. After RFC approval, that dataset type or file content should be removed from the "main" backwards compatibility data repository in testdata_backcomp, along with associated test scripts in backcomp.
            • When a major DM software version is released, the "main" data repository becomes the new "stable" data repository (or one of a few stable repositories; TBD) in testdata_backcomp, and a new "main" data repository is created with the same content. We will attempt to make this new repository use some sort of "copy on write" to reduce bloat, but details are TBD.
            • When code changes would change the content of new repositories in a way that would be forwards-incompatible (old code cannot read new repos), even if this change is backwards compatible (new code can continue to read the old repo), an RFC is also required; when approved, "main" is again moved to "stable", a new "main" data repository is created. The RFC is immediately CCB-flagged if it also involves backwards incompatibility (new code cannot read old repos). The RFC should also include a proposal for what to do with major shared "live" data repositories (e.g. migrate now, migrate after the next major release, migrate whenever some future more-important change forces a migration anyway...), and information on what migration entails. These changes include but are not limited to database schema changes; changes to what raw ingest or instrument registration does for a particular instrument may also qualify. It is expected that whenever possible, pending changes of this type will be guarded by a feature flag that keeps the new repository format from changing by default, and bundled together (to reduce migrations); this paragraph applies when the flags are flipped to enable those changes by default in new repositories.
            • Code changes that change new-repository content in a completely compatible way that may nevertheless affect queries against repos (e.g. slight breaks in reproducibility) are in a bit of a grey area here. I envision these as being primarily tiny schema changes (e.g. string column lengths), filter/detector-definition changes, and raw-header-metadata patches. At least to start, I think we should treat these as being subject to RFC, but where the argument will often be made that migrations (including raw re-ingestions) should be performed on live repos without involving the backwards-compatibility-system and the extra work it entails.

            I think going further than this is really "implementation" work for the RFC, though that obviously includes more design at a lower level. But it's possible that even the above has major logic-holes or omissions, so I'm going to leave this up until the Wednesday (2020-12-09) CCB meeting, intending to adopt and add one or more implementation tickets then if no major issues are raised first.

            Show
            jbosch Jim Bosch added a comment - I think the main principles to capture here are: We will have a testdata_backcomp or similar git LFS repository containing files and one or more Gen3 data repository databases, with detailed structure TBD, and a backcomp package containing test scripts that attempt to load the data from the testdata package and inspect it via (Gen3) butler interfaces. I think we will need for the test data to simultaneously contain multiple repository versions - at least one "stable" that we need to continue to be able to read, and a "main" repo that may differ from that - but whether this is handled by e.g. multiple directories, multiple branches, or multiple testdata packages is TBD. At least at first, I envision these repositories being backed by PosixDatastore and SQLite databases, as we expect backwards compatibility issues that are specific to other datastores or database backends to not be dataset-specific. When a new public dataset type and/or new file content (e.g. a new Exposure component, or type thereof) is added to the Pipeline used in or more of our "standard" reprocessing runs (HSC RC2, DM's DC2 subset, and ap_verify, right now), or added to the shared /datasets repository via an ingest procedure, an instance of that dataset type and/or file conent must be added to the "main" repo of testdata_backcomp and a test is added to backcomp as well. The backcomp package will contain scripts to make importing that new data from the /datasets repository easier, as well as test utility code to make most new test scripts trivial variants on existing ones. This tooling will take some time to develop and refine, and may not be present in the first version of the backcomp package. Adding new content to the backwards compatibility system does not require an RFC. Non-public dataset types may also exist; these may be configurable diagnostics or an implementation detail shared by closely-related PipelineTasks. I probably need to go add some way to mark those on RFC-741 . These do not need to be included in the backwards compatibility system, as long as they do not affect the code used to read public datasets. Dropping support for reading some kind of file content or removing a public dataset type from a standard-reprocessing Pipeline requires an RFC and generally a full-release deprecation period. Our code interface deprecation policy applies here, and CCB approval is needed to skip the deprecation period. After RFC approval, that dataset type or file content should be removed from the "main" backwards compatibility data repository in testdata_backcomp , along with associated test scripts in backcomp . When a major DM software version is released, the "main" data repository becomes the new "stable" data repository (or one of a few stable repositories; TBD) in testdata_backcomp , and a new "main" data repository is created with the same content. We will attempt to make this new repository use some sort of "copy on write" to reduce bloat, but details are TBD. When code changes would change the content of new repositories in a way that would be forwards-incompatible (old code cannot read new repos), even if this change is backwards compatible (new code can continue to read the old repo), an RFC is also required; when approved, "main" is again moved to "stable", a new "main" data repository is created. The RFC is immediately CCB-flagged if it also involves backwards incompatibility (new code cannot read old repos). The RFC should also include a proposal for what to do with major shared "live" data repositories (e.g. migrate now, migrate after the next major release, migrate whenever some future more-important change forces a migration anyway...), and information on what migration entails. These changes include but are not limited to database schema changes; changes to what raw ingest or instrument registration does for a particular instrument may also qualify. It is expected that whenever possible, pending changes of this type will be guarded by a feature flag that keeps the new repository format from changing by default , and bundled together (to reduce migrations); this paragraph applies when the flags are flipped to enable those changes by default in new repositories. Code changes that change new-repository content in a completely compatible way that may nevertheless affect queries against repos (e.g. slight breaks in reproducibility) are in a bit of a grey area here. I envision these as being primarily tiny schema changes (e.g. string column lengths), filter/detector-definition changes, and raw-header-metadata patches. At least to start, I think we should treat these as being subject to RFC, but where the argument will often be made that migrations (including raw re-ingestions) should be performed on live repos without involving the backwards-compatibility-system and the extra work it entails. I think going further than this is really "implementation" work for the RFC, though that obviously includes more design at a lower level. But it's possible that even the above has major logic-holes or omissions, so I'm going to leave this up until the Wednesday (2020-12-09) CCB meeting, intending to adopt and add one or more implementation tickets then if no major issues are raised first.
            Hide
            jbosch Jim Bosch added a comment -

            Finally adopting this, while deferring the remaining design-detail work to implementation tickets as per my previous comment.

            Show
            jbosch Jim Bosch added a comment - Finally adopting this, while deferring the remaining design-detail work to implementation tickets as per my previous comment.

              People

              Assignee:
              erykoff Eli Rykoff
              Reporter:
              erykoff Eli Rykoff
              Watchers:
              Eli Rykoff, Jim Bosch, John Swinbank, Kian-Tat Lim, Krzysztof Findeisen, Leanne Guy, Robert Gruendl [X] (Inactive), Simon Krughoff, Tim Jenness, Yusra AlSayyad
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  Jenkins

                  No builds found.