Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-663

Collections in the Gen3 Butler

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:

      Description

      While nothing in the Gen3 Butler should be considered a stable public API yet, I'm planning a number of changes to the collections system that I'd like to get feedback on from both active Gen3 developers and people looking forward to using it in the future. This attempts to address both user feedback (e.g. DM-16876, DM-21246) and some internal edge-case unsoundness in the current Gen3 system.

      Types of Collections

      This section describes the types of collection that will exist in the Butler; Science Pipelines developers are probably more interested in how they are used (the next section), but should understand the definitions here first.

      RUN

      A dataset is always (and only) added to a run-type collection when it is created, and is removed only when it is deleted entirely.
      Runs are associated with provenance information (probably software versions, maybe configs).
      Their relationship with the datasets they contain is simple but rigid; this makes them very useful for debugging but inconvenient most of the rest of the time.

      When a file-based datastore is used, the filename template will be based on the dataset's RUN collection name.

      Attempting to insert a new dataset with the same dataset type and data ID as one that already exists in a run is an error.

      TAGGED

      Datasets can be added and removed from tagged collections at will.
      They are the most flexible type of collection, but are relatively expensive (there is a database row for every collection-dataset combination), so they shouldn't be used in contexts where a collection should be implicitly created but might never actually be used.

      Attempting to add a dataset to a tagged collection that already has a different dataset with the same dataset type and data ID will cause the original dataset to be removed from the collection.

      VIRTUAL CHAINED

      A virtual chained collection is a stack of other collections; the first dataset found with a particular dataset type and data ID when searching its constituent collections (from "top" to "bottom") is the one "in" the virtual chained collection.

      CALIBRATION

      Like a tagged collection, but each dataset-collection combination is associated with a validity range, and multiple datasets with the same dataset type and data ID may exist in a collection if their validity ranges do not overlap.

      Initializing a Butler

      Pipeline code won't have to create a butler instance itself, but analysis code often will. After this RFC is implemented, Butler will have several constructor keyword arguments to control the collections it is associated with (replacing the current "collection" and "run" arguments):

      • collections: a dictionary mapping collection name to a bool indicating whether the collection should be written to by put and ingest calls. Writing to a VIRTUAL collection writes to the top collection in it, and exactly one RUN collection may be mapped to true after expanding VIRTUAL collections in the dictionary. The order of elements sets the order in which collections are searched for datasets.
      • rw=X: a shortcut for collections={X: True}
      • ro=X: a shortcut for collections={X: False}

      It will also be possible to pass none of these when constructing a Butler and instead pass collections directly to other butler methods.

      I've found a set of constructor arguments that I think are both more flexible (in particular, they address Krzysztof Findeisen's _point about specifying the type of collection to create) while still being simpler in the common case.  See https://github.com/lsst/daf_butler/blob/tickets/DM-21849/python/lsst/daf/butler/_butler.py#L126

      Collection Usage and Naming

      Shared input collections produced by ingest operations (raws, reference catalogs, etc.) may be RUN (if possible) or TAGGED (if necessary) collections. Collections containing "blessed" master calibrations will of course be CALIBRATION collections. I propose we use names of the form:

      raw/<instrument> # all raws, minus blacklisted ones
      raw/<instrument>/everything # even blacklisted ones
      raw/<instrument>/<name> # useful named subsets of raws, e.g. "raw/HSC/RC2"
      refcats # each reference catalog is a different dataset type in Gen3
      calibs/<instrument> # default, recommended calibrations
      calibs/<instrument>/<date> # other sets of master calibrations (e.g. old ones)

      Note that "named subset of raws" collections are for convenience, not necessity; the registry will include other metadata that can be used to select subsets more dynamically, the way we often do now.

      Outputs from processing will always go into a RUN collection, but the default mode of operation for the pipetask tool will be also create a VIRTUAL collection that combines all input collections with the output RUN collection.
      In that default mode, I propose names like:

      u/<user>/<ticket>[/<whatever>]

      for the VIRTUAL collection (given via the --output command-line argument), with the RUN collection name generated automatically as

      u/<user>/<ticket>[/whatever>]/<timestamp>

      Multiple pipetask invocations with the same --output would by default push new RUN collections to the top of the VIRTUAL collection, regardless of whether software versions or configurations change.
      These RUN collections would generally be ignored by users, but would be present (with their rigorous provenance information) when needed.

      I'd also propose we support the following non-default modes of operation:

      --no-chain: do not create a VIRTUAL collection, and instead use the --output argument as the RUN name.

      --extend-run: do not create any new collections, and interpret the --output argument as the name of a collection to be appended to. If this is a RUN collection, new datasets will be added to it; if it is a VIRTUAL collection, new datasets will be added to its top collection, which must be a RUN. All RUN-level provenance information must be consistent with that of the existing RUN.

      --clobber: attempt to delete the collection given by --output up front, and then write outputs to it. Incompatible with --extend-run. See the next section for the implications of this.

      I think these options are largely orthogonal to the options that specify whether to reprocess data IDs for which outputs already exist in the output collection, so I'd like to consider those out of scope for this RFC.

      Removing Datasets and Collections

      Removing a dataset can mean several different things:

      1. Remove it from a collection.
      2. Remove it from the Datastore.
      3. Remove it from the Registry (implies both (1) and (2)).

      Because removal from a Datastore is what usually corresponds to actually deleting a file and freeing up disk space, and removal from Registry means losing provenance information (potentially even for datasets that are not being deleted), we expect removal from Datastores and particular collections without removal from Registry to be fairly common.

      Because datasets can be referenced by multiple collections, we plan to mostly provide APIs for pruning, rather then explicitly deleting datasets:

      This original proposal turns out to be too difficult to implement efficiently (I was basically suggesting we implement a garbage collector in a SQL database).

      • Pruning a dataset will remove it from any Datastores (and optionally the Registry) only if the only collection referencing it is its RUN collection, and no VIRTUAL collections reference that.
      • Pruning a RUN collection will prune all of its datasets, and then remove the RUN iff it has no datasets.
      • Pruning a TAGGED or CALIBRATION collection will disassociate all datasets from it and then prune them. The TAGGED collection will then (always) be removed.
      • Pruning a VIRTUAL collection will prune all of the collections it points to. The VIRTUAL collection will then (always) be removed.

      This new proposal will require a bit more care from users to avoid deleting datasets that are in use by others, but its behavior is probably more predictable (in addition to being much more feasible to implement):

      • butler.prune can be given an iterable over datasets, and by default will only remove them the collections associated with that butler.  It can also be passed "unstore=True" to remove those datasets from the Datastore, relying on the user to be careful not to unstore something someone else may be using (i.e. via some other collection).  It can also be passed "purge=True" (requires "unstore=True", to make sure the consequences are clear), which will fully remove those datasets from the repository.  That requires even more care, because it can result in a loss of provenance information for datasets that were not among those given.
      • butler.prune can instead be given an iterable over collections (mutually exclusive with datasets to avoid confusion).  For TAGGED, CALIBRATION, and CHAINED collections, the datasets in that collection will normally be unaffected.  Once again, you can pass "unstore=True" and "purge=True" to remove from the Datastore or remove entirely, respectively.  RUN collections can only be removed if "purge=True".  Removing a collection that is referenced by a CHAINED collection that was not given is always an error; the CHAINED collection must be removed first.
      • We will provide registry query methods that yield datasets and collections that are (according to various kwarg-defined definitions) "unreferenced" and thus safe to remove, so users can usually use the safer, default prune modes most of the time and then occasionally clean up the garbage more completely.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jbosch Jim Bosch
                Reporter:
                jbosch Jim Bosch
                Watchers:
                Christopher Waters, Gregory Dubois-Felsmann, Ian Sullivan, Jim Bosch, John Parejko, Krzysztof Findeisen, Tim Jenness
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Planned End:

                  Summary Panel