Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-785

Move data ID -> integer packing state and logic out of butler

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Adopted
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      The Gen3 butler followed its predecessor in taking responsibility for packing data IDs of certain types deterministically and reversibly into integer IDs (and the number of bits those IDs require), which our code uses both to mangle into integer source IDs and as random number seeds.  The implementation is primarily in the DimensionPacker classes in daf_butler, which also has some derived classes in the skymap package. There is also logic in astro_metadata_translator to compute a detector_exposure_id integer, which is kept consistent with the DimensionPacker implementation for the same combination only via constant vigilance.

      For these kinds of packers to work, they need to know the ranges of values various dimensions can take, and that is often problematic.  For example:

      • We've baked in some pretty arbitrary guesses at maximum values for visit and exposure IDs into the instrument dimension records that are stored in the butler (making them quite hard to change).
      • The packer for (tract, patch, band) data IDs lives in skymap, and that needs to know - in advance, hard-coded - the full list of all bands anyone might ever use.  That's clearly not practical, and while I intend to just keep extending that list as needed for now (DM-29944 is what prompted this RFC), it's just a bad situation to be in.

      Moreover, because the state for these packers (e.g. maximum values) is either in the data repository or fixed in the code, we cannot change it without breaking our ability to unpack IDs in existing on-disk datasets.

      To deal with this, I propose that we:

      1. Create new data ID packer classes whose state is provided exclusively by pex_config Configuration.  Task code that wishes to use these would nest the packer's configuration within their own, and obs_ package configs and pipeline definitions would be used to provide instrument- or dataset-specific defaults for those ranges of allowable values to task-runners similar levels of convenience.
      2. The DimensionPacker hierarchy in daf_butler and the APIs that access it would be deprecated and removed (deprecation only after the new classes are available; removal a release cycle later, as usual).
      3. VisitInfo.getExposureId (or any postRFC-459 accessor for a combined visit+detector or exposure+detector integer ID) will be deprecated and ultimately removed; algorithm code that wants such an ID should instead get it via another parameter that can be met via the new configuration-based packers. VisitInfo.getExposureId will move to Exposure.id, and will be an opaque ID appropriate for just that image, usually generated in a way that is consistent with one of the new config-driven ID packers (hopefully populated by actually using them, at least most of the time). VisitInfo.id will be created as an opaque ID for either the exposure or visit ID recognized by the butler (appropriately for what that image represents).
      4. astro_metadata_translator.ObservationInfo.detector_exposure_id will be deprecated and ultimately removed.

      In addition, the instrument dimension columns exposure_max and visit_max may be removed in a future schema migration (but this would be accompanied by its own RFC, like any other migration).

        Attachments

          Issue Links

            Activity

            Hide
            Parejkoj John Parejko added a comment -

            If I'm following #3 correctly, this means that there will be no numeric identifier attached to an afw Exposure object?

            Show
            Parejkoj John Parejko added a comment - If I'm following #3 correctly, this means that there will be no numeric identifier attached to an afw Exposure object?
            Hide
            tjenness Tim Jenness added a comment -

            It might be a good time for an Exposure to store its exposure_id and detector as distinct metadata and let the packer worry about combining it.

            Show
            tjenness Tim Jenness added a comment - It might be a good time for an Exposure to store its exposure_id and detector as distinct metadata and let the packer worry about combining it.
            Hide
            jbosch Jim Bosch added a comment - - edited

            If I'm following #3 correctly, this means that there will be no numeric identifier attached to an afw Exposure object?

            Yes, that was my intent.  I am willing to give up on this part of the proposal if necessary to get consensus on the rest, and I'm certainly happy to make this change on a slower timescale than switching the tasks to use something configurable.  I do want to remove the ability to set this value to anything valid in raws, though, because think configuration needs to be available as an input to the calculation (which would definitely complicate setting it if it was still on VisitInfo).  And I think passing those IDs into the code that needs them separately (rather than through Exposure) is better - it makes the provenance and meaning of the number a lot more clear - but I'm open to the argument that it'd be too disruptive to be worthwhile to remove them Exposure  entirely.

            Show
            jbosch Jim Bosch added a comment - - edited If I'm following #3 correctly, this means that there will be no numeric identifier attached to an afw Exposure object? Yes, that was my intent.  I am willing to give up on this part of the proposal if necessary to get consensus on the rest, and I'm certainly happy to make this change on a slower timescale than switching the tasks to use something configurable.  I do want to remove the ability to set this value to anything valid in raws, though, because think configuration needs to be available as an input to the calculation (which would definitely complicate setting it if it was still on VisitInfo).  And I think passing those IDs into the code that needs them separately (rather than through Exposure) is better - it makes the provenance and meaning of the number a lot more clear - but I'm open to the argument that it'd be too disruptive to be worthwhile to remove them Exposure  entirely.
            Hide
            ktl Kian-Tat Lim added a comment -

            For what it's worth, I think of Gen2 as being agnostic about integer identifiers. They're provided on ingest and users can provide matching ones. The ability of a mapper to provide a function that returns one (as a dataset) for a given dataId and to return the bit width is an external convention, in my book, not part of the Butler API. We never got around to having anything in the Gen2 Butler API or the mapper interface that would reliably unpack them. So moving packing outside the Butler is fine with me, although I think the Butler should still be able to use packed identifiers (see below).

            I wouldn't necessarily characterize the limits on parts of the LSST exposure/visit identifiers as "arbitrary"; I think they were thought through and required to fit things in readily-processed 64 bits. I do agree that storing those in instrument dimension records, as well as in two different places in the code, makes things difficult to change.

            I don't really understand how the Config-based packer/unpacker will avoid the evolution issues that the current code-based packer/unpacker has. Even if we store the Configs somewhere so that they can be retrieved in the future, won't it still be difficult to determine which Config was used to produce a given integer identifier if there is more than one?

            If the goal is to instead not use an integer identifier at all, then I think that is problematic. While we have been pushing multipart dataIds for human use, I still see a variety of uses for an integer identifier (and one that ideally a human can decompose visually, not for example a pure hash) as a database key, both inside the Butler and in other science-oriented tables. And I do think such an identifier should be consistent for every use of or reference to the same dataset, which argues for it being assigned as early as possible.

            One way around this would be to add a level of indirection by using a couple of bits in the integer identifier as a version field. Unfortunately, I don't think we have bits to spare.

            Another way could be to avoid the two problematic interfaces: unpacking and packing-after-initial-generation (possibly with different code/config). As long as different packing algorithm versions never collide, this should be viable, but it may be too much of a restriction.

            If you have some way of reliably determining the Config used for a given identifier, however, then (almost?) all of the above goes away.

            Show
            ktl Kian-Tat Lim added a comment - For what it's worth, I think of Gen2 as being agnostic about integer identifiers. They're provided on ingest and users can provide matching ones. The ability of a mapper to provide a function that returns one (as a dataset) for a given dataId and to return the bit width is an external convention, in my book, not part of the Butler API. We never got around to having anything in the Gen2 Butler API or the mapper interface that would reliably unpack them. So moving packing outside the Butler is fine with me, although I think the Butler should still be able to use packed identifiers (see below). I wouldn't necessarily characterize the limits on parts of the LSST exposure/visit identifiers as "arbitrary"; I think they were thought through and required to fit things in readily-processed 64 bits. I do agree that storing those in instrument dimension records, as well as in two different places in the code, makes things difficult to change. I don't really understand how the Config-based packer/unpacker will avoid the evolution issues that the current code-based packer/unpacker has. Even if we store the Configs somewhere so that they can be retrieved in the future, won't it still be difficult to determine which Config was used to produce a given integer identifier if there is more than one? If the goal is to instead not use an integer identifier at all, then I think that is problematic. While we have been pushing multipart dataIds for human use, I still see a variety of uses for an integer identifier (and one that ideally a human can decompose visually, not for example a pure hash) as a database key, both inside the Butler and in other science-oriented tables. And I do think such an identifier should be consistent for every use of or reference to the same dataset, which argues for it being assigned as early as possible. One way around this would be to add a level of indirection by using a couple of bits in the integer identifier as a version field. Unfortunately, I don't think we have bits to spare. Another way could be to avoid the two problematic interfaces: unpacking and packing-after-initial-generation (possibly with different code/config). As long as different packing algorithm versions never collide, this should be viable, but it may be too much of a restriction. If you have some way of reliably determining the Config used for a given identifier, however, then (almost?) all of the above goes away.
            Hide
            jbosch Jim Bosch added a comment -

            I wouldn't necessarily characterize the limits on parts of the LSST exposure/visit identifiers as "arbitrary"; I think they were thought through and required to fit things in readily-processed 64 bits. I do agree that storing those in instrument dimension records, as well as in two different places in the code, makes things difficult to change.

            Oh, agreed - I was thinking more of the limits for HSC and DECam, which I think received much less thought.  Though I think it's also true that even after all of the careful thought that went into the LSST ones, we're still at risk for running out of bits there (or at least don't have any to spare).

            I don't really understand how the Config-based packer/unpacker will avoid the evolution issues that the current code-based packer/unpacker has. Even if we store the Configs somewhere so that they can be retrieved in the future, won't it still be difficult to determine which Config was used to produce a given integer identifier if there is more than one?

            Being able to determine the packer used to create a particular dataset is a lot of what draws me to config: we already save the configuration used in a processing run, and in the future the provenance chain between a dataset and the configuration used to get it will only get better.  I do want to allow the packing algorithms to evolve (I don't want to encourage them to evolve, but think we need more room for evolution than we currently have), and to get that I'm willing to say that the safe way to get an unpacker for the IDs in a dataset has to involve a class or function that relies on that provenance.

            That does mostly rule out using packed integer identifiers inside Butler and perhaps many other databases; it's a solution for Task code, not other things.  If we want to have some data ID packing still alive at lower levels (e.g. astro_metadata_translator) for those use cases, I think the key thing would be getting them out of the business of being used in Source/Object IDs, and hence having to know the maximum number of bits they consume or trying to leave room for per-data ID counters.  In all of the cases I can think of, most of the numbers in play (detectors, tracts, patches) are pretty naturally bounded already within the domain where we care about uniqueness (a particular instrument or skymap), and there's only one number in each of the important sets of dimensions that's hard to bound, at least when bits are scarce (mostly those related to filters or observations).

            Show
            jbosch Jim Bosch added a comment - I wouldn't necessarily characterize the limits on parts of the LSST exposure/visit identifiers as "arbitrary"; I think they were thought through and required to fit things in readily-processed 64 bits. I do agree that storing those in instrument dimension records, as well as in two different places in the code, makes things difficult to change. Oh, agreed - I was thinking more of the limits for HSC and DECam, which I think received much less thought.  Though I think it's also true that even after all of the careful thought that went into the LSST ones, we're still at risk for running out of bits there (or at least don't have any to spare). I don't really understand how the Config-based packer/unpacker will avoid the evolution issues that the current code-based packer/unpacker has. Even if we store the Configs somewhere so that they can be retrieved in the future, won't it still be difficult to determine which Config was used to produce a given integer identifier if there is more than one? Being able to determine the packer used to create a particular dataset is a lot of what draws me to config: we already save the configuration used in a processing run, and in the future the provenance chain between a dataset and the configuration used to get it will only get better.  I do want to allow the packing algorithms to evolve (I don't want to encourage them to evolve, but think we need more room for evolution than we currently have), and to get that I'm willing to say that the safe way to get an unpacker for the IDs in a dataset has to involve a class or function that relies on that provenance. That does mostly rule out using packed integer identifiers inside Butler and perhaps many other databases; it's a solution for Task code, not other things.  If we want to have some data ID packing still alive at lower levels (e.g. astro_metadata_translator) for those use cases, I think the key thing would be getting them out of the business of being used in Source/Object IDs, and hence having to know the maximum number of bits they consume or trying to leave room for per-data ID counters.  In all of the cases I can think of, most of the numbers in play (detectors, tracts, patches) are pretty naturally bounded already within the domain where we care about uniqueness (a particular instrument or skymap), and there's only one number in each of the important sets of dimensions that's hard to bound, at least when bits are scarce (mostly those related to filters or observations).
            Hide
            Parejkoj John Parejko added a comment - - edited

            I do want to remove the ability to set this value to anything valid in raws, though, because think configuration needs to be available as an input to the calculation (which would definitely complicate setting it if it was still on VisitInfo). And I think passing those IDs into the code that needs them separately (rather than through Exposure) is better.

            I always worry when an identifier is decoupled from the thing it identifies. In this case, if said identifier is constructable from the contents of ExposureInfo (is it?) and not attached to an Exposure, then we need to make it trivial for a user to pass an ExposureInfo into some object to get the correct identifier, or always pass the id with the Exposure.

            That implies that we really do need to get visitId and/or "full focal plane exposure id" (as yet unnamed?) into VisitInfo, because otherwise we don't have enough information attached to an Exposure to construct the detector_exposure_id.

            I think of the detector_exposure_id as being something attached to a raw on ingest, that is then propagated down into the subsequent data products from that raw. If it no longer has that purpose (and already, `dataId={"detector":X, "visit":Y, "instrument":Z}` mostly accomplishes that goal for the purposes I can think of), what purpose do we have for detector_exposure_id? Do we need it to construct sourceIds, or can we just make those from the dataId contents above (dropping instrument, since I don't see why sourceId should be consistent across instruments)? Can we get a listing of all of the uses of `detector_exposure_id`?

            Show
            Parejkoj John Parejko added a comment - - edited I do want to remove the ability to set this value to anything valid in raws, though, because think configuration needs to be available as an input to the calculation (which would definitely complicate setting it if it was still on VisitInfo). And I think passing those IDs into the code that needs them separately (rather than through Exposure) is better. I always worry when an identifier is decoupled from the thing it identifies. In this case, if said identifier is constructable from the contents of ExposureInfo (is it?) and not attached to an Exposure, then we need to make it trivial for a user to pass an ExposureInfo into some object to get the correct identifier, or always pass the id with the Exposure. That implies that we really do need to get visitId and/or "full focal plane exposure id" (as yet unnamed?) into VisitInfo, because otherwise we don't have enough information attached to an Exposure to construct the detector_exposure_id. I think of the detector_exposure_id as being something attached to a raw on ingest, that is then propagated down into the subsequent data products from that raw. If it no longer has that purpose (and already, `dataId={"detector":X, "visit":Y, "instrument":Z}` mostly accomplishes that goal for the purposes I can think of), what purpose do we have for detector_exposure_id? Do we need it to construct sourceIds, or can we just make those from the dataId contents above (dropping instrument, since I don't see why sourceId should be consistent across instruments)? Can we get a listing of all of the uses of `detector_exposure_id`?
            Hide
            jbosch Jim Bosch added a comment -

            I always worry when an identifier is decoupled from the thing it identifies.

            This is a good way to frame the problem; one way to put my concern is that I fear we are coupling Exposures with an identifier that is only associated with that object in a particular context (a particular pipeline configuration with a particular ID-packing algorithm) that is implicit, or worse, the identifier is actually associated with some other related object (e.g. the raw this calexp was built from) and isn't appropriate for that Exposure at all.  I'm only slightly bothered by not coupling related things, and much more bothered by coupling unrelated things.

            But I think the only uses of detector_exposure_id are via the badly-named number it stuff in VisitInfo, and I don't actually know what uses that.

            If the answer doesn't include "generate source and object IDs", then I have less of a problem with it; as I mentioned in my last post, it's the "how many bits could this consume" question that brings in the need for extra state/context.

            Show
            jbosch Jim Bosch added a comment - I always worry when an identifier is decoupled from the thing it identifies. This is a good way to frame the problem; one way to put my concern is that I fear we are coupling Exposures with an identifier that is only associated with that object in a particular context (a particular pipeline configuration with a particular ID-packing algorithm) that is implicit, or worse, the identifier is actually associated with some other related object (e.g. the raw this calexp was built from) and isn't appropriate for that Exposure at all.  I'm only slightly bothered by not coupling related things, and much more bothered by coupling unrelated things. But I think the only uses of detector_exposure_id are via the badly-named number it stuff in VisitInfo , and I don't actually know what uses that. If the answer doesn't include "generate source and object IDs", then I have less of a problem with it; as I mentioned in my last post, it's the "how many bits could this consume" question that brings in the need for extra state/context.
            Hide
            Parejkoj John Parejko added a comment -

            After discussion with Jim Bosch, I think we agree that we will have these identifiers attached to an afw.Exposure, and no others:

            • VisitInfo.id : an exposure- (snap) or visit-level identifier of a full focal plane image. This corresponds to the id in either the exposure table or visit table in the butler (but which of those is opaque to the user).
            • Exposure.id : a generic identifier of an afw Exposure object. This could be a "detector exposure id", a "coadd patch id", or a "detector visit id", or something else, but is opaque to users and cannot be unpacked. This does not correspond to any identifier in a butler table.

            We do not believe we can provide guarantees about the maximum number of bits in either of these identifiers, because of their their generic nature.

            Examples of current uses of "exposureId", and their replacements include:

            • grouping all Exposures in a visit: use VisitInfo.id.
            • seeding random number generators repeatably: use either Exposure.id or Visit.id, depending on whether the seed is for a detector image or a full focal plane image.
            • identifying the image that a cutout or source came from (for example, a DiaSource): use VisitInfo.id and Detector.id.
            • constructing SourceIds: use neither of these identifiers, but rather one of the packers described in the RFC above.
            Show
            Parejkoj John Parejko added a comment - After discussion with Jim Bosch , I think we agree that we will have these identifiers attached to an afw.Exposure, and no others: VisitInfo.id : an exposure- (snap) or visit-level identifier of a full focal plane image. This corresponds to the id in either the exposure table or visit table in the butler (but which of those is opaque to the user). Exposure.id : a generic identifier of an afw Exposure object. This could be a "detector exposure id", a "coadd patch id", or a "detector visit id", or something else, but is opaque to users and cannot be unpacked. This does not correspond to any identifier in a butler table. We do not believe we can provide guarantees about the maximum number of bits in either of these identifiers, because of their their generic nature. Examples of current uses of "exposureId", and their replacements include: grouping all Exposures in a visit: use VisitInfo.id . seeding random number generators repeatably: use either Exposure.id or Visit.id , depending on whether the seed is for a detector image or a full focal plane image. identifying the image that a cutout or source came from (for example, a DiaSource): use VisitInfo.id and Detector.id . constructing SourceIds: use neither of these identifiers, but rather one of the packers described in the RFC above.
            Hide
            tjenness Tim Jenness added a comment -

            Escalating this given the extensive API changes and deprecations.

            Show
            tjenness Tim Jenness added a comment - Escalating this given the extensive API changes and deprecations.
            Hide
            mrawls Meredith Rawls added a comment -

            Thank you Jim and John for discussing this in depth and iterating to what seems like a good solution.

            I especially appreciate the examples for doing things like backing out the visit and detector that a source in a catalog came from.

            Please help me with another example. How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?

            Show
            mrawls Meredith Rawls added a comment - Thank you Jim and John for discussing this in depth and iterating to what seems like a good solution. I especially appreciate the examples for doing things like backing out the visit and detector that a source in a catalog came from. Please help me with another example. How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?
            Hide
            jbosch Jim Bosch added a comment - - edited

            I have updated the ticket description to reflect what's on the table after my conversation with John Parejko.

            For Meredith Rawls

            How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead?

            You would continue to use VisitInfo just as you have so far; we're basically acknowledging that it (like afw.image.Exposure) has a name that's really badly inconsistent with the new butler dimension nomenclature, because sometimes it represents just a single exposure and sometimes it represents a visit, but we are not trying to fix the naming problem right now. If we can find some way to rename it to ObservationInfo (but that clashes with the pure-Python similar thing from astro_metadata_translator) or something similarly generic, that would be ideal, but that's a matter for a future RFC.

            Show
            jbosch Jim Bosch added a comment - - edited I have updated the ticket description to reflect what's on the table after my conversation with John Parejko . For Meredith Rawls How should a user retrieve visit- or detector-level metadata when all one has is an Exposure object? I tend to think of postISRCCDs or calexps (to give two badly-named examples) interchangeably as "exposures" or "visits." When I use butler.get, I specify dimensions like visit and detector in order to get an Exposure object. But let's say I need my Exposure's airmass, which I am pretty sure is a VisitInfo-flavored thing. Today I would use, e.g., Exposure.getInfo.getVisitInfo.getAirmass (or something to that effect). What am I supposed to do instead? You would continue to use VisitInfo just as you have so far; we're basically acknowledging that it (like afw.image.Exposure ) has a name that's really badly inconsistent with the new butler dimension nomenclature, because sometimes it represents just a single exposure and sometimes it represents a visit , but we are not trying to fix the naming problem right now. If we can find some way to rename it to ObservationInfo (but that clashes with the pure-Python similar thing from astro_metadata_translator) or something similarly generic, that would be ideal, but that's a matter for a future RFC.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Clare Saunders, Jim Bosch, John Parejko, Kian-Tat Lim, Meredith Rawls, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  Jenkins

                  No builds found.