Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-4168

productize "Data repository selection based on version"

    Details

    • Type: Story
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: butler
    • Labels:
      None
    • Story Points:
      6
    • Sprint:
      DB_W16_02
    • Team:
      Data Access and Database

      Description

      finish & productize work from DM-5608

        Attachments

          Issue Links

            Activity

            Hide
            npease Nate Pease added a comment -

            Consult https://jira.lsstcorp.org/browse/RFC-95 (search for “version”) and read down from there about how they want to configure & specify repository roots. they talk about rerun a lot and that’s captured here. But not captured is:
            there may be multiple versions of a repository (like data release 1 and data release 2). Users need to be able to select easily between them.
            Also, want to be able to select different versions of different reference catalogs using the butler (right now they are selected thru EUPS).

            Show
            npease Nate Pease added a comment - Consult https://jira.lsstcorp.org/browse/RFC-95 (search for “version”) and read down from there about how they want to configure & specify repository roots. they talk about rerun a lot and that’s captured here. But not captured is: there may be multiple versions of a repository (like data release 1 and data release 2). Users need to be able to select easily between them. Also, want to be able to select different versions of different reference catalogs using the butler (right now they are selected thru EUPS).
            Hide
            rowen Russell Owen added a comment -

            There are other important aspects to this. For example color term correction data is keyed by two different items, both of which may change with time: the camera used to collect the data (possibly even the type of CCD in that camera) being corrected, and the reference catalog.

            Show
            rowen Russell Owen added a comment - There are other important aspects to this. For example color term correction data is keyed by two different items, both of which may change with time: the camera used to collect the data (possibly even the type of CCD in that camera) being corrected, and the reference catalog.
            Hide
            krughoff Simon Krughoff added a comment -

            I would also really like to be able to be able to do this by time as well. E.g. "Butler, please give me the color correction terms I should have used if I was reducing this data last week."

            The default would be "now" and I think we will want to tag certain times as special, i.e. Release x.

            Show
            krughoff Simon Krughoff added a comment - I would also really like to be able to be able to do this by time as well. E.g. "Butler, please give me the color correction terms I should have used if I was reducing this data last week." The default would be "now" and I think we will want to tag certain times as special, i.e. Release x.
            Hide
            ktl Kian-Tat Lim added a comment -

            The direction the code is going is a good one; in particular, the repository of repository configs is a good primitive for this and similar use cases. But there are still substantial portions of this code (some unrelated to the ticket itself) that do not feel like they yet form a releasable feature set that we could announce in release notes. I've been thinking a bit about how to deal with complex, multi-part, interdependent developments like this with the goal of making sure that users are not disrupted while new, not-quite-ready features are being built, and I think it comes down to two alternatives:

            • Merging to a long-lived integration branch that is not master
            • Merging to master with a "version switch" triggered by (in this case) a Butler construction argument or perhaps an environment variable that only enables the new interfaces and implementation when explicitly requested.

            If interface changes are extensive and could be frequent due to uncertainty and evolution, then the first is probably preferable to minimize disruption to dependent package users until the interface is firmed up. If interface changes are expected to be minimal and infrequent because the interface is well-defined, then the second could be acceptable and would help with the eventual merge since dependent package users could help maintain compatibility with both versions while making unrelated changes.

            One of the things I worry about here is that the current lack of definition around Access and Storage (and the entire plugin serialization model) means that repository configurations are still unstable. The current code appears to expose this in both construction of Repositories (and hence Butlers) and in the persisted configuration files (which do not appear to have explicit code for dealing with evolution). On the other hand, we may be able to present an external interface that hides all of this complexity by providing a normal use case with pre-existing or internally-generated configurations (unlike the code-based example in LDM-463 for this ticket), in which case a version switch and "don't look behind the curtain" could be acceptable. (Note that modifications to existing configurations via code or manual overrides will become a normal use case in the future, so that interface does need to be fully defined and exposed.)

            So before anything is merged to master, I would like the following to take place:

            • Decide which of the above strategies is to be used and implement it.
            • Work through a complete example of how this primitive can be deployed in a particular use case such as the multi-version, date-range-based master calibration image repository and incorporate that into LDM-463.
            • Deal with any minor code comments that I expect to make in the PR later today.
            Show
            ktl Kian-Tat Lim added a comment - The direction the code is going is a good one; in particular, the repository of repository configs is a good primitive for this and similar use cases. But there are still substantial portions of this code (some unrelated to the ticket itself) that do not feel like they yet form a releasable feature set that we could announce in release notes. I've been thinking a bit about how to deal with complex, multi-part, interdependent developments like this with the goal of making sure that users are not disrupted while new, not-quite-ready features are being built, and I think it comes down to two alternatives: Merging to a long-lived integration branch that is not master Merging to master with a "version switch" triggered by (in this case) a Butler construction argument or perhaps an environment variable that only enables the new interfaces and implementation when explicitly requested. If interface changes are extensive and could be frequent due to uncertainty and evolution, then the first is probably preferable to minimize disruption to dependent package users until the interface is firmed up. If interface changes are expected to be minimal and infrequent because the interface is well-defined, then the second could be acceptable and would help with the eventual merge since dependent package users could help maintain compatibility with both versions while making unrelated changes. One of the things I worry about here is that the current lack of definition around Access and Storage (and the entire plugin serialization model) means that repository configurations are still unstable. The current code appears to expose this in both construction of Repositories (and hence Butlers) and in the persisted configuration files (which do not appear to have explicit code for dealing with evolution). On the other hand, we may be able to present an external interface that hides all of this complexity by providing a normal use case with pre-existing or internally-generated configurations (unlike the code-based example in LDM-463 for this ticket), in which case a version switch and "don't look behind the curtain" could be acceptable. (Note that modifications to existing configurations via code or manual overrides will become a normal use case in the future, so that interface does need to be fully defined and exposed.) So before anything is merged to master, I would like the following to take place: Decide which of the above strategies is to be used and implement it. Work through a complete example of how this primitive can be deployed in a particular use case such as the multi-version, date-range-based master calibration image repository and incorporate that into LDM-463. Deal with any minor code comments that I expect to make in the PR later today.
            Hide
            npease Nate Pease added a comment -

            I was hoping that we could get away with not having a development branch but at this point I'm inclined to agree that something is necessary.
            What about using the daf_butler package, or a new package? I think the benefits are that it would work in CI without having to set a var, and would allow other ticket branches to be built using cutting edge butler features. I expect we could merge changes back to daf_persistence and daf_butlerUtils if needed/desired?

            Show
            npease Nate Pease added a comment - I was hoping that we could get away with not having a development branch but at this point I'm inclined to agree that something is necessary. What about using the daf_butler package, or a new package? I think the benefits are that it would work in CI without having to set a var, and would allow other ticket branches to be built using cutting edge butler features. I expect we could merge changes back to daf_persistence and daf_butlerUtils if needed/desired?
            Hide
            npease Nate Pease added a comment -

            This story represents additional work needed beyond recently completed butler infrastructure work to support lookup for repositories. I.E gather specific requirements for mappers (and possibly registries) so that a repository can be looked up given science needs.

            Next step is to write an RFC.

            Show
            npease Nate Pease added a comment - This story represents additional work needed beyond recently completed butler infrastructure work to support lookup for repositories. I.E gather specific requirements for mappers (and possibly registries) so that a repository can be looked up given science needs. Next step is to write an RFC.

              People

              • Assignee:
                Unassigned
                Reporter:
                npease Nate Pease
                Reviewers:
                Kian-Tat Lim
                Watchers:
                Gregory Dubois-Felsmann, Kian-Tat Lim, Nate Pease, Russell Owen, Simon Krughoff
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Summary Panel