Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-917

Data ID -> integer packing for Rubin instruments and related code removals

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      This is a follow-up to RFC-785, which involves a mechanism for making data ID to integer packing schemes more configurable and less dependent on hard-to-change Butler database values.  The implementation of that RFC is nearly complete (on DM-31924), but at the same time I'm building on that to address the fact that our current packing scheme for Rubin instruments does not leave enough room for source IDs.

      In this RFC, I'd like to get feedback my new scheme for Rubin data ID packing and highlight some code changes that RFC-785 implied without spelling out, since things are now much more concrete.

      New packing algorithm for exposure+detector or visit+detector data IDs

      1. Split the given exposure or visit ID into day_obs, seq_num, and controller code.
      2. Convert the controller code to an integer via the fixed sequence "OCHPQ" (nothing new so far)
      3. Convert day_obs into a date ordinal and subtract an arbitrary offset (currently corresponding to 2010-01-01), recovering the space previously occupied by months 13-99 and days ~32-99.
      4. Assign the lowest byte (8 bits) to the detector ID (leaving room for 256). Compare to previous upper bound of 999 (for decimal human friendliness).
      5. Assign the first 14 bits to seq_num (=max of 16384, or one readout every 5.27s). Compare to previous upper bound of 99999.
      6. Assign the next 14 bits to the ordinal day (16384 days is 45 years and change, taking us to 2055 as our maximum date). Compare to previous max of 99999999 (for decimal YYYYMMDD)
      7. Assign the next 3 bits to the controller code (leaving space for 8, i.e. 3 more than the 5 we have now).
      8. Reserve one final bit for visit IDs in which the visit is the one-to-one reinterpretation of the first exposure in a multi-snap sequence as its own visit (in all other cases the visit and exposure IDs are the same).

      This sums to 40 bits reserved for the packed data ID, leaving 23 bits for a per-data ID source counter (for historical FITS reasons we are limited to signed int64), which means 16 million detections on a CCD. That's more then enough, which is good because I think we might in the future also want to reserve another 4 bits for a data release index, leaving 19 bits = 524k for sources.

      This packing scheme is designed to be at least somewhat interpretable in hex, but mostly I expect users to unpack it via code rather than their eyes.

      I currently envision having most Rubin instruments (all LSSTCam variants and simulations, LATISS, ComCam) use the same numbers, because I think having consistency across those instruments is more important than freeing up a few detector bits with LATISS (or ComCam). I am not planning to adopt this scheme for TS3, TS8, and UCDCam, as these already have specializations to how their exposure IDs are computed from (day_obs, seq_num) due to various idiosyncracies.

      Where to use these, and what to do with astro_metadata_translator's packed IDs

      This packing scheme has been designed to solve problems in the common case where we want to pack a visit+detector data ID and a per-data ID source counter together. But we also have cases where a packed visit+detector data ID integer exists in its own right:

      • afw.image.ExposureInfo objects carry such an ID. This is not to be confused with VisitInfo's ID, which no longer corresponds to a particular detector. These are added by the butler raw formatter, delegating to the instrument's translator class, and from there propagated to all Exposure objects downstream of raw.
      • Some of our source-like tables have a separate ccdVisitId column to serve as a foreign key to the CcdVisit table (see this slack thread).

      I propose we replace these IDs with the new scheme, too - or, more precisely, modify the code that produces them to use the RFC-785 mechanisms that will lead to the new scheme for Rubin instruments, but make no change for other instruments - because I think it's more important to have only one kind of packed visit+detector ID for a particular instrument than it is to have some of them be more human-readable.

      If we do switch these to the new RFC-785 mechanism, we could then deprecate and remove the "exposure_detector_id" methods and attributes from astro_metadata_translator and the various translator implementations. I propose we just remove them without deprecation, as any other code using them now probably shouldn't ever have been, and definitely shouldn't be now. But this would be a big disruptive refactor for relatively gain: while this would leave some duplication of important logic, the behavior of that logic is already quite transparent and carefully change-controlled, and this strongly mitigates the biggest problems that usually come with duplication.

      I am not planning to try to fix any old packed data IDs in data repositories. None of them should be in the butler registry or raws (since they're added on read to the latter), so we're just talking about IDs present in processing outputs, of which virtually all will eventually get superseded.

      RFC-785 specifics

      RFC-785 originally proposed removing the lsst.daf.butler.DimensionPacker class and all methods involving them, but I've found it just as effective (and less disruptive) to keep that ABC while deprecating the factory methods (DataCoordinate.pack and DimensionUniverse.makePacker) and implementations of DimensionPacker and aspects that aren't compatible with making the behavior configurable.

      I'm also deprecating lsst.obs.base.ExposureIdInfo, which was closely tied to those interfaces and isn't really compatible with being extended to support unpacking these IDs back into data IDs, which I really wanted to add. Its replacement will be lsst.meas.base.IdGenerator, which does many of the same things more conveniently and flexibly (and also does unpacking).

      Removing "group" metadata visit definitions

      We still have "by-group-metadata" visit-system implementation defined in lsst.obs.base that has been wholly superseded by the "one-to-one-and-by-counter" system after RFC-836, and I believe it's a dangerous trap that, if ever used, would lead to many regrets and data repository admin cleanup work. We should remove it without deprecation.

      The obs_lsst translators also reference this scheme in some methods that should similarly be removed - at least to_visit_id. In fact, I think it's now time to remove any references to visits from astro_metadata_translator, as we want to centralize visit definition in obs_base, instead of having any translator guess in advance what the configured visit system used later might be.

        Attachments

          Issue Links

            Activity

            Hide
            jbkalmbach Bryce Kalmbach added a comment -

            Jim Bosch, the data we are using for AOS testing now with the Commissioning scientists all have dates post-2010 so I don't think there's any need to worry about us.

            Show
            jbkalmbach Bryce Kalmbach added a comment - Jim Bosch , the data we are using for AOS testing now with the Commissioning scientists all have dates post-2010 so I don't think there's any need to worry about us.
            Hide
            jbosch Jim Bosch added a comment - - edited

            The packer is getting the controller by unpacking the exposure_id or visit_id (I've added a new method to the translators to do that unpacking, so it's adjacent to the packing code). There are a few important cases where it'd be problematic to demand an expanded full data ID in order to pack it (as opposed to just the dimension record for the instrument dimension, which we already require), so I don't use the day_obs and seq_num from dimension records even when they might be available.

            You may be right that we don't need the controller for real data, but I can't make the number of bits depend on anything other than the Instrument given the interfaces we have, so even if some controllers don't need to free up bits for sources, I can't really do that.

            Show
            jbosch Jim Bosch added a comment - - edited The packer is getting the controller by unpacking the exposure_id or visit_id (I've added a new method to the translators to do that unpacking, so it's adjacent to the packing code). There are a few important cases where it'd be problematic to demand an expanded full data ID in order to pack it (as opposed to just the dimension record for the instrument dimension, which we already require), so I don't use the day_obs and seq_num from dimension records even when they might be available. You may be right that we don't need the controller for real data, but I can't make the number of bits depend on anything other than the Instrument given the interfaces we have, so even if some controllers don't need to free up bits for sources, I can't really do that.
            Hide
            ktl Kian-Tat Lim added a comment -

            The DM-CCB had no objections to this. Jim should still consult with the Camera team to make sure that 16384 images per day is sufficient.

            Show
            ktl Kian-Tat Lim added a comment - The DM-CCB had no objections to this. Jim should still consult with the Camera team to make sure that 16384 images per day is sufficient.
            Hide
            jbosch Jim Bosch added a comment -

            Some discussion on Slack has converged to the camera team wanting one more bit for the sequence number, i.e. an upper bound of 32k.

            That leaves the packed data ID at 41 bits, which with 4 bits allocated to DRs leaves 18 for the counter, i.e. 262k sources/detector.  I think that's fine, and even giving the release counter 5 bits is probably fine (131k sources/detector), but I think it's also prudent to reserve some bits for the unexpected. Right now I'm not going to allocate any bits to the release ID on main anyway, I'm making it configurable, but that's all, so we can make that decision later.

            Show
            jbosch Jim Bosch added a comment - Some discussion on Slack has converged to the camera team wanting one more bit for the sequence number, i.e. an upper bound of 32k. That leaves the packed data ID at 41 bits, which with 4 bits allocated to DRs leaves 18 for the counter, i.e. 262k sources/detector.  I think that's fine, and even giving the release counter 5 bits is probably fine (131k sources/detector), but I think it's also prudent to reserve some bits for the unexpected. Right now I'm not going to allocate any bits to the release ID on main anyway, I'm making it configurable, but that's all, so we can make that decision later.
            Hide
            jbosch Jim Bosch added a comment -

            We've decided not to proceed with removing the group-by-metadata visit definition scheme (which the Rubin instruments are no longer using). The rest of this has already been implemented.

            Show
            jbosch Jim Bosch added a comment - We've decided not to proceed with removing the group-by-metadata visit definition scheme (which the Rubin instruments are no longer using). The rest of this has already been implemented.

              People

              Assignee:
              jbosch Jim Bosch
              Reporter:
              jbosch Jim Bosch
              Watchers:
              Andrew Connolly, Bryce Kalmbach, Eric Bellm, James Chiang, Jim Bosch, John Parejko, Kian-Tat Lim, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.