Details
-
Type:
RFC
-
Status: Proposed
-
Resolution: Unresolved
-
Component/s: DM
-
Labels:None
Description
I learned today that an otherwise highly desirable effort to standardize image file formats and ingest code paths between the real LSSTCam and PhoSim had also ended up in the separate Instrument key (in the Butler dimension universe) for PhoSim data being eliminated.
I think this last change was an error and would like to suggest that we reconsider.
From specific past experience, and on general principles, when a project has a highly detailed simulation of its raw data available, it's desirable for the processing of the simulated data to follow the same code paths as those for the real data as much as possible.
E.g., in the BaBar HEP experiment we had a code-review rule that essentially banned "if (simulation) { } else { }" sorts of logic except in very narrowly prescribed contexts.
So the elimination of differences in ingest and instrument-handling code between LSSTCam and PhoSim is in general a very good thing.
But another lesson from past experience is that there often are some cases where distinguishing the two is necessary - at which point it's advisable to make that as visible and explicit as possible, again to facilitate code review and operational awareness.
Important exceptions that I've encountered are:
- It's necessary to prevent simulated data from being inadvertently pulled into analyses together with real data a/k/a "even if the code can't tell the difference, humans (and production control / campaign management systems) must be able to";
- It's generally undesirable and/or impractical to arrange the simulation so that it uses the actual calibration data from the real system - e.g., the same flats.
Sometimes in past projects, especially in HEP, I've encountered a trick used of putting the simulated data in the past, or the far distant future, using only time as the axis on which to separate the two and assign different calibrations to them. (E.g., in BaBar we put the simulated data in the 1970s; but even then, we explicitly banned any "if ( date < 1998 ) { treat simulated data differently }" code.)
But for Rubin I think that's not an option - since the real universe is time-varying, there is a good case for permitting the simulated data to live on the same dates as the real data. This facilitates, for instance, the use of the real observing schedule to drive future simulations.
Having said all that as preamble:
I think it makes sense to maintain "Instrument" as a Butler dimension that can distinguish simulated data from real LSSTCam data.
Without this, it seems like the only other tool for keeping the calibrations disentangled is to mandate some distinctive pattern of use of collection names. This seems awkward and artificial, and difficult to enforce over time.
I don't think it's realistic to demand another alternative, that real and simulated data never be mixed in the same Butler repository - users may well have reasons to do this in their own personal repositories and collections, even if we don't do it in the centrally-maintained repos/collections.
It will have to become a design rule that is enforced in reviews, etc., that the Instrument not be used to "cheat" on the real/sim distinction in pipeline code, of course.
(I suspect this RFC will require CCB validation, if it's not shot down in flames well before then.)
Attachments
Issue Links
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
I very much agree with this for all similar future cases, and I regret not pushing back harder when I first heard of the plan to pass off PhoSim data as if it were real LSSTCam data.
But I don't have a great sense for how hard it would be to fix it now. I might be able to provide some of that picture, but I don't know how much of this data is in the wild and how it's labeled or used at present, and even I did I'd have to think about the details of what a migration for this might look like.