Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-184

Proposed changes to Butler (and its API) for multiple repository support

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      Description

      We need to make some changes to Butler to better support using multiple repositories. We discussed with KT & Gregory, and in RFD. The new design should allow:

      • multiple input and output repositories to be managed by Butler
      • control over which input repositories are used when calling Butler.get (see "tagging" below)
      • provide needed support for saving intermediate outputs for reruns (this is not discussed any more in this RFC)
      • Butler.put should write the object to the correct repository/repositories (not to all repositories) according to the dataset type.

      We decided :

      • Butler will manage its own input and output repositories, and it may have any number of each (including zero).
      • The api to instantiate a butler and its repositories will be Butler(inputs=..., outputs=...), where inputs and outputs are repository configurations or a sequence type of repository configurations.
        • Changing the API in existing code would be a bit of work. We can manage it so that existing code does not have to change (until it needs to take advantage of new butler features): we will use keyword arguments inputs and outputs, and not modify the default signature, except to add default value root=None. This allows the old API to be used by default, and the new API to be called explicitly.
      • Butler will only read from input repositories and will treat outputs as read-write (repositories may enforce write-only rules internally).
      • Repositories will keep track of their parents.
      • All the input repositories will become parents of each output repository.

      Repository configurations can be stored inside a repository and also can be stored seperately from the repository (there are implications to each, more info below).

      Inputs and Outputs

      Butler will only perform read actions on input repositories and will perform read and write actions on output repositories. Repositories may also have an internal mode that can be one of:

      • read
      • write
      • read-write

      Repository r/w mode must be enforced inside the repository. Attempting to pass a read-only repo as a butler output or a write-only repo as a butler input will raise an exception.

      Input repository configurations must specify certain parameters. In some cases output configurations may be more sparsely populated and derive parameter values from input configs, but the inputs must be uniform. See the next section.

      Output configuration derived from inputs

      Some settings for output configurations can be derived from input configurations. For example, if an output configuration does not specify a mapper, the input mapper may possibly be assumed (this will work as long as all the input repositories use the same type of mapper; if the inputs use different types of mapper then a single type mapper can not be inferred to use for the output repositories). When possible the butler will use settings from input configurations to complete output configurations.

      Search Order

      The order of repositories passed to inputs and outputs is meaningful; search is depth-first and in order (left to right). See the attached diagram.

      Tagging

      Input repositories can be “tagged” with a temporary id that gets used when reading from a
      repository. RepositoryCfg will have a method to ‘tag’ a repository with a value or object. A repository
      can be tagged with more than one tag by passing in a container of tags. The tag is not persisted with the
      repository.

      Originally we were planning to extend the Butler API so that Butler functions that perform read operations on repositories would take an optional tag argument, e.g. def get(self, datasetType, dataId={}, immediate=False, tag=None, **rest). After conversations with Jim Bosch it became clear that the dataId and tag should be thought of more as a unit. This allows the dataId to encapsulate the fact that it is intended for a particular repository (or repositories) (and this implicitly means that the dataId keys conform to the semantics of the mapper + policy in that repository).

      A new class will be created that contains the dataId dictionary and the tag.

      class DataId(collections.UserDict)
          def __init__(self, id={}, tag=None):
              ...
      

      • tag may be a string or other type, including container types. When searching repositories, if the tag
        argument is not None, then repositories will only be searched if their tag equals the value of tag (or if a
        match is found in either container of tags).
      • When searching, if an input repository is tagged, all of its parents will be searched (even if they do not
        have a tag).
      • The Butler API will remain backwards compatible: if a dict is passed for dataId, as needed it will be
        constructed into a DataId object internally. For example, both of the following will work (but only the latter will limit the search according to tagged repositories)

        # dataId-as-dict
        butler.get("calexp", {'visit':1, 'ccd':1})
        # dataId-as-DataId
        butler.get("calexp", DataId(id={'visit':1, 'ccd':1}, tag='CFHT'))
        

      • DataId will be a type of dict, the map functions will access the contained id (the key-value part of the dataId); existing calls on the
        dataId by e.g. mapper subclasses will continue to work.

      Repository Configuration

      Repository Configurations can be in different states. The state dictates which parameters must be populated.

      • Getting ready to butler.put() a configuration to a repo-of-repos
        • Mapper defined.
        • Storage & Root might be defined (it often would be left undefined, and then it will be defined by metarepo. This is useful when the repos exist ‘in place’, directly within the metarepo).
        • Parents may be defined.
        • Mode must be defined.
      • Ready to use as Butler input:
        • Points to a location of an existing repository but no other info is known.
          • Mapper not defined.
          • Storage & Root must be defined.
          • Parents not defined.
          • Mode not defined.
        • Info is known (already deserialized cfg or otherwise not deserializing)
          • Mapper might be defined (in Butler it could be inferred from input repository/repositories).
          • Storage & Root must be defined.
          • Parents must be defined (if any).
          • Mode must be defined.
      • Ready to use as Butler output
        • Mapper may be defined. (If it is not and Butler’s inputs all have the same mapper then that mapper will be added to the cfg. If Butler’s inputs have different mapper types then Butler will throw instead of assigning the mapper).
        • Storage & Root must be defined.
        • Parents must not be defined. (The Butler’s inputs will be added as parents to the cfg).
        • Mode must be defined (one of ‘w’ or ‘rw’. ‘r’ (read-only) will throw).
      • Serialized in place (cfg resides in root of repository)
        • Mapper must be defined.
        • Storage & Root may be defined (at least in some cases it can be inferred from the physical location of the repository.
        • Parents must be defined.
        • Mode must be defined.
      • Serialized elsewhere (cfg does NOT reside in repository)
      • If repo also has a cfg in place:
        • Same as “points to a location of an existing repository…”.
        • If repo does not have a cfg in place or if other settings are desired:
          • Mapper must be defined.
          • Storage & Root must be defined.
          • Parents must be defined.
          • Mode must be defined.

      Getting a Configuration

      A configuration can be retrieved simply by specifying root. On a posix system, this will be a string (a path). In other storages this may need to be a more complex object.I propose to add a static getter:

      class RepositoryCfg:
         @staticmethod
         def getCfg(root):
            """Returns a RepositoryCfg from the cfg stored at root"""
      

      Examples

      Creating, getting, and using repository cfg

      “I got this input repo config from some location (the location in this case does not matter) and I want to instantiate a butler with it as my input repo, and with an output at <here>.
      For a single output, say <here> is at "foo/bar"
      For multiple output, say <here> is at "foo/bar" and "foo/baz"

      # single input & output
      import lsst.daf.persistence as dp
      inputCfg = foo.getCfg()
      outputCfg = dp.RepositoryCfg()
      outputCfg.setRoot("foo/bar")
      butler = dp.Butler(inputs=inputCfg, outputs=outputCfg)
       
      # multiple inputs & outputs
      import lsst.daf.persistence as dp
      inputCfgs = foo.getCfgs()
      outputCfgs = []
      for root in ("foo/bar", "foo/baz"):
      	cfg = dp.RepositoryCfg()
      	cfg.setRoot(root)
      	# I think in actual use there would be some other difference between the repositories
      	# that would be the reason to have 2 output repositories. Maybe they exist in different
      	# storages (for example, Posix and S3), or have different mappers.
      	outputCfgs.append(cfg)
      butler = dp.Butler(inputs=inputCfgs, outputs=outputCfgs)
      

      “I got this input repo config AND an output repo config from some location (that does not matter) and I want to set up a butler with it as my input repo, and with that output repo”

      # single input & output
      import lsst.daf.persistence as dp
      inputCfg = foo.getInputCfg()
      outputCfg = foo.getOutputCfg()
      butler = dp.Butler(inputs=inputCfg, outputs=outputCfg)
       
      # multiple inputs & outputs
      import lsst.daf.persistence as dp
      inputCfgs = foo.getInputCfgs()
      outputCfgs = foo.getOutputCfgs()
      butler = dp.Butler(inputs=inputCfgs, outputs=outputCfgs)
      

      “I’m creating a new output repository (with no parents) and want it to go <here>, with this mapper”.
      For a single output, say <here> is at "foo/bar"
      For multiple output, say <here> is at "foo/bar" and "foo/baz"

      ## single output
      import lsst.daf.persistence as dp
      outputCfg = dp.RepositoryCfg()
      outputCfg.setRoot("foo/bar")
       
      # any of these should be supported as iarg to setMapper:
      if option1:
          outputCfg.setMapper(dp.CameraMapper)
      elif option2:
          outputCfg.setMapper('lsst.daf.butlerUtils.CameraMapper')
      else: # option3
          mapper = MyMapper()
          outputCfg.setMapper(mapper)
      butler = dp.Butler(outputs=outputCfg)
       
      ## multiple outputs
      import lsst.daf.persistence as dp
      outputCfgs = []
      for root in ("foo/bar", "foo/baz"):
          outputCfg = dp.RepositoryCfg()
          outputCfg.setRoot("foo/bar")
          # any of these should be supported as iarg to setMapper:
          if option1: 
              outputCfg.setMapper(dp.CameraMapper)
          elif option2: 
      	outputCfg.setMapper('lsst.daf.butlerUtils.CameraMapper')
          else: #option3
              mapper = MyMapper()
              outputCfg.setMapper(mapper)
          outputCfgs.append(outputCfg)
      butler = dp.Butler(outputs=outputCfgs)
      

      Repository Tagging

      ”I have 2 repositories that contain datasets with identical dataset type and dataId, of different versions and I want to compare datasets from those repositories”

      import lsst.daf.persistence as dp
      oldCfg = getCfg(version=1)
      newCfg = getCfg(version=2)
       
      oldCfg.setTag(‘old’)
      newCfg.setTag(‘new’)
       
      butler = dp.Butler(inputs=(oldCfg, newCfg))
      oldData = butler.get(datasetType=‘data’, dataId={someId:123}, tag=‘old’)
      newData = butler.get(datasetType=‘data’, dataId={someId:123}, tag=‘new’)
       
      # and then, for example:
      results = compare(oldData, newData)
      ...
      

      Diagram

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment -

            In this case you're exposing the semantic differences of the dataIds to the task, isn't it reasonable to expect that the Task will have to know about the different meanings of the dataIds (both the synonyms and the homonyms)?

            It's important to distinguish between the Task and the activator/user here. The Task should just see an already-constructed Butler and a mostly-opaque list of data IDs (it's not entirely opaque because it needs to be able to group them in some mapper-defined way). The activator/user constructs the Butler and the list of data IDs, and is in a position to attach the appropriate tag to each data ID.

            But it sounds like that per-dataID tag is all we need to add to what the Task gets, and I agree that makes sense - but rather than change task interfaces to take tuples of (dataID, tag), it'd be much preferable to just package the tag inside the data ID. Would that be possible?

            By “aggregated data product”, I think you mean a python object that has been created using objects from 2 or more repositories from as many different cameras (is that right?).

            Right, but not just a Python object - a Python object we'll want to persist in the output repo.

            For dataId: I think you'll have to choose a mapper and set of dataId keys with the semantics that you want to use. What other aspects of this am I missing?

            What I'm getting at is the pattern for how the mapper for the output repository is defined, and how the set of concrete mappers relates to the set of concrete datasets. There are some datasets that need to be defined differently for each camera (like the "calexp" that's an input in this example), while other datasets probably could be defined the same way for all cameras (such as the aggregate data products), but aren't mandated to be. We need some way to automatically define a mapper for the output from the input mappers, and I'm worried about figuring out how that automatic mapper defines those output data products.

            I think that in the current design, the only way mappers are automatically defined is by just delegating to one of the mappers from the input repositories. But if the input mappers disagree on how to define the aggregate output, this could get very confusing; it greatly increases the visibility of the selection of the "winning" mapper. I think we can probably address this by ensuring somehow that potential aggregate datasets are defined consistently across mappers, but I'm also a bit worried that one of those mappers might have a good reason to define them differently, which we probably wouldn't want to utilize unless we explicitly ask for it.

            On the other hand, maybe this is a moot point once we have dataset prototypes, because it will then be the Task's job to define the output dataset directly in the output repository, and hence none of the camera-specific mappers will be in play?

            Show
            jbosch Jim Bosch added a comment - In this case you're exposing the semantic differences of the dataIds to the task, isn't it reasonable to expect that the Task will have to know about the different meanings of the dataIds (both the synonyms and the homonyms)? It's important to distinguish between the Task and the activator/user here. The Task should just see an already-constructed Butler and a mostly-opaque list of data IDs (it's not entirely opaque because it needs to be able to group them in some mapper-defined way). The activator/user constructs the Butler and the list of data IDs, and is in a position to attach the appropriate tag to each data ID. But it sounds like that per-dataID tag is all we need to add to what the Task gets, and I agree that makes sense - but rather than change task interfaces to take tuples of (dataID, tag), it'd be much preferable to just package the tag inside the data ID. Would that be possible? By “aggregated data product”, I think you mean a python object that has been created using objects from 2 or more repositories from as many different cameras (is that right?). Right, but not just a Python object - a Python object we'll want to persist in the output repo. For dataId: I think you'll have to choose a mapper and set of dataId keys with the semantics that you want to use. What other aspects of this am I missing? What I'm getting at is the pattern for how the mapper for the output repository is defined, and how the set of concrete mappers relates to the set of concrete datasets. There are some datasets that need to be defined differently for each camera (like the "calexp" that's an input in this example), while other datasets probably could be defined the same way for all cameras (such as the aggregate data products), but aren't mandated to be. We need some way to automatically define a mapper for the output from the input mappers, and I'm worried about figuring out how that automatic mapper defines those output data products. I think that in the current design, the only way mappers are automatically defined is by just delegating to one of the mappers from the input repositories. But if the input mappers disagree on how to define the aggregate output, this could get very confusing; it greatly increases the visibility of the selection of the "winning" mapper. I think we can probably address this by ensuring somehow that potential aggregate datasets are defined consistently across mappers, but I'm also a bit worried that one of those mappers might have a good reason to define them differently, which we probably wouldn't want to utilize unless we explicitly ask for it. On the other hand, maybe this is a moot point once we have dataset prototypes, because it will then be the Task's job to define the output dataset directly in the output repository, and hence none of the camera-specific mappers will be in play?
            Hide
            npease Nate Pease added a comment - - edited

            What I'm getting at is the pattern for how the mapper for the output repository is defined

            I think that in the current design, the only way mappers are automatically defined is by just delegating to one of the mappers from the input repositories. But if the input mappers disagree on how to define the aggregate output, this could get very confusing

            Very confusing indeed! In fact: impossible (IMO).

            I said in "Output configuration derived from inputs" that when the inputs use the same mapper, the output mapper may be assumed to be of the same type. Implicit in this (although not explicitly stated) is that when the inputs use different mappers, the output configuration must specify the mapper. That would be done by the Activator when it was configuring the butler for the task I think.

            Regarding the dataId and tag: I'm having a hard time figuring out how the Task will use the dataId without having some understanding of the semantic meaning of the dataId keys. I guess the "mostly opaque list of dataIds" is passed to the Task by the Activator, and it passes those through to the butler directly (without inspecting or modifying the list)? So then the Activator owns the role of knowing what the mapper is and therefore which dataIds to use?
            Does the Activator also populate the values in the dataId or is that done by the Task?
            Is the dataset type hard coded in the task, or is it passed to the task by the Activator as well?

            Jim BoschMaybe it would help to talk in real time about dataId and the activator & task for 30 minutes or so? I have time this afternoon if you’re available, or Tuesday.

            Show
            npease Nate Pease added a comment - - edited What I'm getting at is the pattern for how the mapper for the output repository is defined I think that in the current design, the only way mappers are automatically defined is by just delegating to one of the mappers from the input repositories. But if the input mappers disagree on how to define the aggregate output, this could get very confusing Very confusing indeed! In fact: impossible (IMO). I said in "Output configuration derived from inputs" that when the inputs use the same mapper, the output mapper may be assumed to be of the same type. Implicit in this (although not explicitly stated) is that when the inputs use different mappers, the output configuration must specify the mapper. That would be done by the Activator when it was configuring the butler for the task I think. Regarding the dataId and tag: I'm having a hard time figuring out how the Task will use the dataId without having some understanding of the semantic meaning of the dataId keys. I guess the "mostly opaque list of dataIds" is passed to the Task by the Activator, and it passes those through to the butler directly (without inspecting or modifying the list)? So then the Activator owns the role of knowing what the mapper is and therefore which dataIds to use? Does the Activator also populate the values in the dataId or is that done by the Task? Is the dataset type hard coded in the task, or is it passed to the task by the Activator as well? Jim Bosch Maybe it would help to talk in real time about dataId and the activator & task for 30 minutes or so? I have time this afternoon if you’re available, or Tuesday.
            Hide
            jbosch Jim Bosch added a comment -

            Talking in person sounds like a good idea, but unfortunately this afternoon is pretty full for me. I should be free any time within Tuesday 11am-3pm PDT for a half-hour talk.

            Show
            jbosch Jim Bosch added a comment - Talking in person sounds like a good idea, but unfortunately this afternoon is pretty full for me. I should be free any time within Tuesday 11am-3pm PDT for a half-hour talk.
            Hide
            npease Nate Pease added a comment -

            Still need to talk about Jim Bosch about some issues, and update RFC as needed. Moving planned end to this Friday.

            Show
            npease Nate Pease added a comment - Still need to talk about Jim Bosch about some issues, and update RFC as needed. Moving planned end to this Friday.
            Hide
            npease Nate Pease added a comment -

            After conversation with Jim Bosch and Kian-Tat Lim I've made a fairly substantial change to the Tagging section, and added a new class for dataId (a move away from keeping the dataId in a pure dict). I've pushed the end of this RFC to Tuesday which I hope will give stakeholders time to read & process. If you need more time please let me know.

            Show
            npease Nate Pease added a comment - After conversation with Jim Bosch and Kian-Tat Lim I've made a fairly substantial change to the Tagging section, and added a new class for dataId (a move away from keeping the dataId in a pure dict). I've pushed the end of this RFC to Tuesday which I hope will give stakeholders time to read & process. If you need more time please let me know.

              People

              • Assignee:
                npease Nate Pease
                Reporter:
                npease Nate Pease
                Watchers:
                Gregory Dubois-Felsmann, Jim Bosch, Kian-Tat Lim, Matias Carrasco Kind, Nate Pease, Robert Lupton, Russell Owen, Tim Jenness, Xiuqin Wu [X] (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Planned End:

                  Summary Panel