Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-785

Move data ID -> integer packing state and logic out of butler

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      The Gen3 butler followed its predecessor in taking responsibility for packing data IDs of certain types deterministically and reversibly into integer IDs (and the number of bits those IDs require), which our code uses both to mangle into integer source IDs and as random number seeds.  The implementation is primarily in the DimensionPacker classes in daf_butler, which also has some derived classes in the skymap package. There is also logic in astro_metadata_translator to compute a detector_exposure_id integer, which is kept consistent with the DimensionPacker implementation for the same combination only via constant vigilance.

      For these kinds of packers to work, they need to know the ranges of values various dimensions can take, and that is often problematic.  For example:

      • We've baked in some pretty arbitrary guesses at maximum values for visit and exposure IDs into the instrument dimension records that are stored in the butler (making them quite hard to change).
      • The packer for (tract, patch, band) data IDs lives in skymap, and that needs to know - in advance, hard-coded - the full list of all bands anyone might ever use.  That's clearly not practical, and while I intend to just keep extending that list as needed for now (DM-29944 is what prompted this RFC), it's just a bad situation to be in.

      Moreover, because the state for these packers (e.g. maximum values) is either in the data repository or fixed in the code, we cannot change it without breaking our ability to unpack IDs in existing on-disk datasets.

      To deal with this, I propose that we:

      1. Create new data ID packer classes whose state is provided exclusively by pex_config Configuration.  Task code that wishes to use these would nest the packer's configuration within their own, and obs_ package configs and pipeline definitions would be used to provide instrument- or dataset-specific defaults for those ranges of allowable values to task-runners similar levels of convenience.
      2. The DimensionPacker hierarchy in daf_butler and the APIs that access it would be deprecated and removed (deprecation only after the new classes are available; removal a release cycle later, as usual).
      3. VisitInfo.getExposureId (or any postRFC-459 accessor for a combined visit+detector or exposure+detector integer ID) will be deprecated and ultimately removed; algorithm code that wants such an ID should instead get it via another parameter that can be met via the new configuration-based packers. VisitInfo.getExposureId will move to Exposure.id, and will be an opaque ID appropriate for just that image, usually generated in a way that is consistent with one of the new config-driven ID packers (hopefully populated by actually using them, at least most of the time). VisitInfo.id will be created as an opaque ID for either the exposure or visit ID recognized by the butler (appropriately for what that image represents).
      4. astro_metadata_translator.ObservationInfo.detector_exposure_id will be deprecated and ultimately removed.

      In addition, the instrument dimension columns exposure_max and visit_max may be removed in a future schema migration (but this would be accompanied by its own RFC, like any other migration).

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            I have updated the ticket description to reflect what's on the table after my conversation with John Parejko.

            For Meredith Rawls

            How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?

            You would continue to use VisitInfo just as you have so far; we're basically acknowledging that it (like afw.image.Exposure) has a name that's really badly inconsistent with the new butler dimension nomenclature, because sometimes it represents just a single exposure and sometimes it represents a visit, but we are not trying to fix the naming problem right now. If we can find some way to rename it to ObservationInfo (but that clashes with the pure-Python similar thing from astro_metadata_translator) or something similarly generic, that would be ideal, but that's a matter for a future RFC.

            Show
            jbosch Jim Bosch added a comment - - edited I have updated the ticket description to reflect what's on the table after my conversation with John Parejko . For Meredith Rawls How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead? You would continue to use VisitInfo just as you have so far; we're basically acknowledging that it (like afw.image.Exposure ) has a name that's really badly inconsistent with the new butler dimension nomenclature, because sometimes it represents just a single exposure and sometimes it represents a visit , but we are not trying to fix the naming problem right now. If we can find some way to rename it to ObservationInfo (but that clashes with the pure-Python similar thing from astro_metadata_translator) or something similarly generic, that would be ideal, but that's a matter for a future RFC.
            Hide
            mrawls Meredith Rawls added a comment -

            Thank you Jim and John for discussing this in depth and iterating to what seems like a good solution.

            I especially appreciate the examples for doing things like backing out the visit and detector that a source in a catalog came from.

            Please help me with another example. How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?

            Show
            mrawls Meredith Rawls added a comment - Thank you Jim and John for discussing this in depth and iterating to what seems like a good solution. I especially appreciate the examples for doing things like backing out the visit and detector that a source in a catalog came from. Please help me with another example. How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?
            Hide
            tjenness Tim Jenness added a comment -

            Escalating this given the extensive API changes and deprecations.

            Show
            tjenness Tim Jenness added a comment - Escalating this given the extensive API changes and deprecations.
            Hide
            Parejkoj John Parejko added a comment -

            After discussion with Jim Bosch, I think we agree that we will have these identifiers attached to an afw.Exposure, and no others:

            • VisitInfo.id : an exposure- (snap) or visit-level identifier of a full focal plane image. This corresponds to the id in either the exposure table or visit table in the butler (but which of those is opaque to the user).
            • Exposure.id : a generic identifier of an afw Exposure object. This could be a "detector exposure id", a "coadd patch id", or a "detector visit id", or something else, but is opaque to users and cannot be unpacked. This does not correspond to any identifier in a butler table.

            We do not believe we can provide guarantees about the maximum number of bits in either of these identifiers, because of their their generic nature.

            Examples of current uses of "exposureId", and their replacements include:

            • grouping all Exposures in a visit: use VisitInfo.id.
            • seeding random number generators repeatably: use either Exposure.id or Visit.id, depending on whether the seed is for a detector image or a full focal plane image.
            • identifying the image that a cutout or source came from (for example, a DiaSource): use VisitInfo.id and Detector.id.
            • constructing SourceIds: use neither of these identifiers, but rather one of the packers described in the RFC above.
            Show
            Parejkoj John Parejko added a comment - After discussion with Jim Bosch , I think we agree that we will have these identifiers attached to an afw.Exposure, and no others: VisitInfo.id : an exposure- (snap) or visit-level identifier of a full focal plane image. This corresponds to the id in either the exposure table or visit table in the butler (but which of those is opaque to the user). Exposure.id : a generic identifier of an afw Exposure object. This could be a "detector exposure id", a "coadd patch id", or a "detector visit id", or something else, but is opaque to users and cannot be unpacked. This does not correspond to any identifier in a butler table. We do not believe we can provide guarantees about the maximum number of bits in either of these identifiers, because of their their generic nature. Examples of current uses of "exposureId", and their replacements include: grouping all Exposures in a visit: use VisitInfo.id . seeding random number generators repeatably: use either Exposure.id or Visit.id , depending on whether the seed is for a detector image or a full focal plane image. identifying the image that a cutout or source came from (for example, a DiaSource): use VisitInfo.id and Detector.id . constructing SourceIds: use neither of these identifiers, but rather one of the packers described in the RFC above.
            Hide
            jbosch Jim Bosch added a comment -

            I always worry when an identifier is decoupled from the thing it identifies.

            This is a good way to frame the problem; one way to put my concern is that I fear we are coupling Exposures with an identifier that is only associated with that object in a particular context (a particular pipeline configuration with a particular ID-packing algorithm) that is implicit, or worse, the identifier is actually associated with some other related object (e.g. the raw this calexp was built from) and isn't appropriate for that Exposure at all.  I'm only slightly bothered by not coupling related things, and much more bothered by coupling unrelated things.

            But I think the only uses of detector_exposure_id are via the badly-named number it stuff in VisitInfo, and I don't actually know what uses that.

            If the answer doesn't include "generate source and object IDs", then I have less of a problem with it; as I mentioned in my last post, it's the "how many bits could this consume" question that brings in the need for extra state/context.

            Show
            jbosch Jim Bosch added a comment - I always worry when an identifier is decoupled from the thing it identifies. This is a good way to frame the problem; one way to put my concern is that I fear we are coupling Exposures with an identifier that is only associated with that object in a particular context (a particular pipeline configuration with a particular ID-packing algorithm) that is implicit, or worse, the identifier is actually associated with some other related object (e.g. the raw this calexp was built from) and isn't appropriate for that Exposure at all.  I'm only slightly bothered by not coupling related things, and much more bothered by coupling unrelated things. But I think the only uses of detector_exposure_id are via the badly-named number it stuff in VisitInfo , and I don't actually know what uses that. If the answer doesn't include "generate source and object IDs", then I have less of a problem with it; as I mentioned in my last post, it's the "how many bits could this consume" question that brings in the need for extra state/context.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Clare Saunders, Jim Bosch, John Parejko, Kian-Tat Lim, Meredith Rawls, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  Jenkins

                  No builds found.