Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-902

Revert to separate "Instrument" for simulated LSSTCam data

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Proposed
    • Resolution: Unresolved
    • Component/s: DM
    • Labels:
      None

      Description

      I learned today that an otherwise highly desirable effort to standardize image file formats and ingest code paths between the real LSSTCam and PhoSim had also ended up in the separate Instrument key (in the Butler dimension universe) for PhoSim data being eliminated.

      I think this last change was an error and would like to suggest that we reconsider.

      From specific past experience, and on general principles, when a project has a highly detailed simulation of its raw data available, it's desirable for the processing of the simulated data to follow the same code paths as those for the real data as much as possible.

      E.g., in the BaBar HEP experiment we had a code-review rule that essentially banned "if (simulation) { } else { }" sorts of logic except in very narrowly prescribed contexts.

      So the elimination of differences in ingest and instrument-handling code between LSSTCam and PhoSim is in general a very good thing.

      But another lesson from past experience is that there often are some cases where distinguishing the two is necessary - at which point it's advisable to make that as visible and explicit as possible, again to facilitate code review and operational awareness.

      Important exceptions that I've encountered are:

      • It's necessary to prevent simulated data from being inadvertently pulled into analyses together with real data a/k/a "even if the code can't tell the difference, humans (and production control / campaign management systems) must be able to";
      • It's generally undesirable and/or impractical to arrange the simulation so that it uses the actual calibration data from the real system - e.g., the same flats.

      Sometimes in past projects, especially in HEP, I've encountered a trick used of putting the simulated data in the past, or the far distant future, using only time as the axis on which to separate the two and assign different calibrations to them. (E.g., in BaBar we put the simulated data in the 1970s; but even then, we explicitly banned any "if ( date < 1998 ) { treat simulated data differently }" code.)

      But for Rubin I think that's not an option - since the real universe is time-varying, there is a good case for permitting the simulated data to live on the same dates as the real data. This facilitates, for instance, the use of the real observing schedule to drive future simulations.

      Having said all that as preamble:

      I think it makes sense to maintain "Instrument" as a Butler dimension that can distinguish simulated data from real LSSTCam data.

      Without this, it seems like the only other tool for keeping the calibrations disentangled is to mandate some distinctive pattern of use of collection names. This seems awkward and artificial, and difficult to enforce over time.

      I don't think it's realistic to demand another alternative, that real and simulated data never be mixed in the same Butler repository - users may well have reasons to do this in their own personal repositories and collections, even if we don't do it in the centrally-maintained repos/collections.

      It will have to become a design rule that is enforced in reviews, etc., that the Instrument not be used to "cheat" on the real/sim distinction in pipeline code, of course.

      (I suspect this RFC will require CCB validation, if it's not shot down in flames well before then.)

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            I very much agree with this for all similar future cases, and I regret not pushing back harder when I first heard of the plan to pass off PhoSim data as if it were real LSSTCam data.

            But I don't have a great sense for how hard it would be to fix it now. I might be able to provide some of that picture, but I don't know how much of this data is in the wild and how it's labeled or used at present, and even I did I'd have to think about the details of what a migration for this might look like.

            Show
            jbosch Jim Bosch added a comment - - edited I very much agree with this for all similar future cases, and I regret not pushing back harder when I first heard of the plan to pass off PhoSim data as if it were real LSSTCam data. But I don't have a great sense for how hard it would be to fix it now. I might be able to provide some of that picture, but I don't know how much of this data is in the wild and how it's labeled or used at present, and even I did I'd have to think about the details of what a migration for this might look like.
            Hide
            Parejkoj John Parejko added a comment -

            I also strongly agree: phoSim data should not be labeled as LSSTCam data. Using the butler Instrument dataId to treat different data differently is something we shouldn't be doing anywhere anyway: all differences between instruments should be taken care of by astro_metadata_translator when the raws are read in.

            Show
            Parejkoj John Parejko added a comment - I also strongly agree: phoSim data should not be labeled as LSSTCam data. Using the butler Instrument dataId to treat different data differently is something we shouldn't be doing anywhere anyway: all differences between instruments should be taken care of by astro_metadata_translator when the raws are read in.
            Hide
            tjenness Tim Jenness added a comment -

            Once PhoSim data matched the LSSTCam header specification it seemed natural at the time to make it act like LSSTCam data. The CONTROLLER was different (an H) so it was clear that this data were phosim and not real but at the time we hadn't really thought through the butler ramifications for doing this in terms of unfortunate clashes of day_obs/seq_num tuples potentially in the future. For CCS data we have decided that we can't even afford that ambiguity in real data and are working on making day_obs/seq_num unique regardless of O or C controller.

            I fully support tweaking the translators such that PhoSim data will still use the LSSTCam translator code (the old PhoSim headers were incompatible) but ends up as the LSSTCam-PhoSim instrument. The easiest way to "migrate" would be to reingest the data whereby they would turn up in a completely different place – it's not worth trying to write a migration script to convert the exposure records over to the PhoSim form (I'm assuming we don't have petabytes of PhoSim data to reingest).

            Exposure records do now have a flag to indicate that the data are simulated but you still can't tell the difference between fully simulated (PhoSim) versus partially simulated real data (we lied about the telescope moving).

            Show
            tjenness Tim Jenness added a comment - Once PhoSim data matched the LSSTCam header specification it seemed natural at the time to make it act like LSSTCam data. The CONTROLLER was different (an H) so it was clear that this data were phosim and not real but at the time we hadn't really thought through the butler ramifications for doing this in terms of unfortunate clashes of day_obs/seq_num tuples potentially in the future. For CCS data we have decided that we can't even afford that ambiguity in real data and are working on making day_obs/seq_num unique regardless of O or C controller. I fully support tweaking the translators such that PhoSim data will still use the LSSTCam translator code (the old PhoSim headers were incompatible) but ends up as the LSSTCam-PhoSim instrument. The easiest way to "migrate" would be to reingest the data whereby they would turn up in a completely different place – it's not worth trying to write a migration script to convert the exposure records over to the PhoSim form (I'm assuming we don't have petabytes of PhoSim data to reingest). Exposure records do now have a flag to indicate that the data are simulated but you still can't tell the difference between fully simulated (PhoSim) versus partially simulated real data (we lied about the telescope moving).
            Hide
            ksuberlak Krzysztof Suberlak added a comment -

            For one, the simulated data usually lived in their own repo (I do not recall ever ingesting phosim data to eg. `/repo/main`), so I have not encountered this issue of data "clashing" . What is the specific use case that was troublesome? 

            Show
            ksuberlak Krzysztof Suberlak added a comment - For one, the simulated data usually lived in their own repo (I do not recall ever ingesting phosim data to eg. `/repo/main`), so I have not encountered this issue of data "clashing" . What is the specific use case that was troublesome? 
            Hide
            tjenness Tim Jenness added a comment -

            I think that requiring distinct repositories for phosim and imsim data is a good solution but the design can't require this – it has to be an agreed upon convention. In the general case if someone did, by mistake, ingest some phosim data into a butler then it becomes immediately apparent that confusion is possible unless care is taken to ensure that the "has_simulated" flag is always used. We also have to be very careful about calibration datasets being placed into their own phosim collection because the instrument name would not be an automatic disambiguation and if there is any chance of a phosim observation looking like it came from the same night as other data then that could also be problematic.

            I think Gregory Dubois-Felsmann is treating PhoSim as an example with an eye on this becoming a formal recommendation for butler datasets that are aiming to closely simulate real instruments. Requiring that a distinct component of the dataId be unique for simulated data removes any possibility of confusion.

            Show
            tjenness Tim Jenness added a comment - I think that requiring distinct repositories for phosim and imsim data is a good solution but the design can't require this – it has to be an agreed upon convention. In the general case if someone did, by mistake, ingest some phosim data into a butler then it becomes immediately apparent that confusion is possible unless care is taken to ensure that the "has_simulated" flag is always used. We also have to be very careful about calibration datasets being placed into their own phosim collection because the instrument name would not be an automatic disambiguation and if there is any chance of a phosim observation looking like it came from the same night as other data then that could also be problematic. I think Gregory Dubois-Felsmann is treating PhoSim as an example with an eye on this becoming a formal recommendation for butler datasets that are aiming to closely simulate real instruments. Requiring that a distinct component of the dataId be unique for simulated data removes any possibility of confusion.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            Tim is correct in the immediately preceding interpretation of my position.

            Krzysztof Suberlak, there was no actual collision yet - I'm trying to head this off in advance, based on a lot of closely related past experience, before the current situation has been around for too long to change.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - Tim is correct in the immediately preceding interpretation of my position. Krzysztof Suberlak , there was no actual collision yet - I'm trying to head this off in advance, based on a lot of closely related past experience, before the current situation has been around for too long to change.
            Hide
            tjenness Tim Jenness added a comment -

            Gregory Dubois-Felsmann this sounds like a good idea to me so I would be happy for the RFC to be adopted. It will require some work in obs_lsst to deal with phosim looking like LSSTCam and I believe is in agreement with James Chiang's plans for imsim.

            Show
            tjenness Tim Jenness added a comment - Gregory Dubois-Felsmann this sounds like a good idea to me so I would be happy for the RFC to be adopted. It will require some work in obs_lsst to deal with phosim looking like LSSTCam and I believe is in agreement with James Chiang 's plans for imsim.

              People

              Assignee:
              gpdf Gregory Dubois-Felsmann
              Reporter:
              gpdf Gregory Dubois-Felsmann
              Watchers:
              Andrew Connolly, Gregory Dubois-Felsmann, James Chiang, Jim Bosch, John Parejko, Krzysztof Suberlak, Tim Jenness
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Planned End:

                  Jenkins

                  No builds found.