Details
-
Type:
RFC
-
Status: Implemented
-
Resolution: Done
-
Component/s: DM
-
Labels:None
Description
In RFC-777 we discuss the proposal to change the main butler registries from using incrementing integer IDs to UUIDs. As part of that we need to decide how we ingest raw data. There is a desire for a raw file ingested into one repository to have the same UUID as it would have when ingested into a separate repository. This can significantly simplify provenance tracking when data processed at one location needs to be ingested into another independent repository where the raw files were ingested independently.
We will implement the predictability using UUID5. There are three possible options currently being discussed and opinions would be welcomed on these:
- Use just the DatasetType + DataID (instrument/detector/exposure)
- Use the DatasetType + DataID and the run collection.
- Have the ingest raw task itself generate the UUID5 using some other scheme. This scheme would have to be determined.
The first two options are currently supported directly by Butler. With UUID-based registries any ingest can in theory come up with their own UUID5 scheme (this is how import works).
If option 1 is used a registry can only ever ingest one copy of a raw anywhere. This could be problematic if a user ingests some raw DECam data they have into a private collection and then the observatory attempts to ingest that same data into the shared raw collection.
Option 2 may be the only option for imsim if dataIds are reused for new simulation runs.
I'd vote for a combination of (1) and (2), where different kinds of raws are handled differently. Anything we could do with (3) is probably better expressed as a super-strong policy convention for naming certain collections and then using (2).
A lot of data is pretty clear-cut on which it should use:
That leaves Rubin data, where the existence of multiple controllers and hence multiple raws for the same observation complicates things. I'm hoping that in every case we can declare one version of each observation the canonical one and use (1) on those raws, but I don't know the landscape well enough to know if that's safe, or to have a good proposal for the data that isn't or might not be canonical. I'm hoping someone who does know that landscape well can write down an explicit proposal for all of the kinds of Rubin raws we have, so we can see if that maps well to any Butler organization concepts we could dispatch on.