RFC-777 we discuss the proposal to change the main butler registries from using incrementing integer IDs to UUIDs. As part of that we need to decide how we ingest raw data. There is a desire for a raw file ingested into one repository to have the same UUID as it would have when ingested into a separate repository. This can significantly simplify provenance tracking when data processed at one location needs to be ingested into another independent repository where the raw files were ingested independently.
We will implement the predictability using UUID5. There are three possible options currently being discussed and opinions would be welcomed on these:
- Use just the DatasetType + DataID (instrument/detector/exposure)
- Use the DatasetType + DataID and the run collection.
- Have the ingest raw task itself generate the UUID5 using some other scheme. This scheme would have to be determined.
The first two options are currently supported directly by Butler. With UUID-based registries any ingest can in theory come up with their own UUID5 scheme (this is how import works).
If option 1 is used a registry can only ever ingest one copy of a raw anywhere. This could be problematic if a user ingests some raw DECam data they have into a private collection and then the observatory attempts to ingest that same data into the shared raw collection.
Option 2 may be the only option for imsim if dataIds are reused for new simulation runs.