Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-778

Determine UUID5 scheme to use for raw data ingest

    XMLWordPrintable

    Details

    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
      None

      Description

      In RFC-777 we discuss the proposal to change the main butler registries from using incrementing integer IDs to UUIDs. As part of that we need to decide how we ingest raw data. There is a desire for a raw file ingested into one repository to have the same UUID as it would have when ingested into a separate repository. This can significantly simplify provenance tracking when data processed at one location needs to be ingested into another independent repository where the raw files were ingested independently.

      We will implement the predictability using UUID5. There are three possible options currently being discussed and opinions would be welcomed on these:

      • Use just the DatasetType + DataID (instrument/detector/exposure)
      • Use the DatasetType + DataID and the run collection.
      • Have the ingest raw task itself generate the UUID5 using some other scheme. This scheme would have to be determined.

      The first two options are currently supported directly by Butler. With UUID-based registries any ingest can in theory come up with their own UUID5 scheme (this is how import works).

      If option 1 is used a registry can only ever ingest one copy of a raw anywhere. This could be problematic if a user ingests some raw DECam data they have into a private collection and then the observatory attempts to ingest that same data into the shared raw collection.

      Option 2 may be the only option for imsim if dataIds are reused for new simulation runs.

        Attachments

          Issue Links

            Activity

            Hide
            jbosch Jim Bosch added a comment - - edited

            I'd vote for a combination of (1) and (2), where different kinds of raws are handled differently. Anything we could do with (3) is probably better expressed as a super-strong policy convention for naming certain collections and then using (2).

            A lot of data is pretty clear-cut on which it should use:

            • Any mature physical instrument that already has a universally unique mapping from data ID to a particular raw can use (1). That list includes at least HSC, DECam and CFHT.
            • Simulated data must only use (2), and embed a unique identifier for the simulation version into the collection (the "2.2i/raw/all" name we use for DC2 ImSim satisfies this, I believe).

            That leaves Rubin data, where the existence of multiple controllers and hence multiple raws for the same observation complicates things. I'm hoping that in every case we can declare one version of each observation the canonical one and use (1) on those raws, but I don't know the landscape well enough to know if that's safe, or to have a good proposal for the data that isn't or might not be canonical. I'm hoping someone who does know that landscape well can write down an explicit proposal for all of the kinds of Rubin raws we have, so we can see if that maps well to any Butler organization concepts we could dispatch on.

            Show
            jbosch Jim Bosch added a comment - - edited I'd vote for a combination of (1) and (2), where different kinds of raws are handled differently. Anything we could do with (3) is probably better expressed as a super-strong policy convention for naming certain collections and then using (2). A lot of data is pretty clear-cut on which it should use: Any mature physical instrument that already has a universally unique mapping from data ID to a particular raw can use (1). That list includes at least HSC, DECam and CFHT. Simulated data must only use (2), and embed a unique identifier for the simulation version into the collection (the "2.2i/raw/all" name we use for DC2 ImSim satisfies this, I believe). That leaves Rubin data, where the existence of multiple controllers and hence multiple raws for the same observation complicates things. I'm hoping that in every case we can declare one version of each observation the canonical one and use (1) on those raws, but I don't know the landscape well enough to know if that's safe, or to have a good proposal for the data that isn't or might not be canonical. I'm hoping someone who does know that landscape well can write down an explicit proposal for all of the kinds of Rubin raws we have, so we can see if that maps well to any Butler organization concepts we could dispatch on.
            Hide
            tjenness Tim Jenness added a comment -

            (1) might work for Rubin data where we know we are the source of truth. It might not work for CFHT/DECam data in the sense that we are not the primary holder and people may bring their own raws in from outside that they want to ingest into their own collections. Effectively meaning that we should only use (1) in the global shared raw collections and that really means that we are using (2).

            Show
            tjenness Tim Jenness added a comment - (1) might work for Rubin data where we know we are the source of truth. It might not work for CFHT/DECam data in the sense that we are not the primary holder and people may bring their own raws in from outside that they want to ingest into their own collections. Effectively meaning that we should only use (1) in the global shared raw collections and that really means that we are using (2).
            Hide
            tjenness Tim Jenness added a comment -

            Also, the multiple controller issue is not a problem. Those files have distinct obsid/exposure_id. The problem is the current situation where camera always writes a file but with a cut down header (no external information) and archiver writes the file with the full header. If you try to ingest both in the same repo you can't because their definition of the exposure dimension differs.

            Show
            tjenness Tim Jenness added a comment - Also, the multiple controller issue is not a problem. Those files have distinct obsid/exposure_id. The problem is the current situation where camera always writes a file but with a cut down header (no external information) and archiver writes the file with the full header. If you try to ingest both in the same repo you can't because their definition of the exposure dimension differs.
            Hide
            jbosch Jim Bosch added a comment -

            After thinking about this more from the perspective of repositories with science users, where they will be ingesting HSC, DECam, CFHT data (etc) rather than us, I think I agree with time that we should include the collection in the UUID all the time. It leaves the door open for the same raw to be ingested into a repo multiple times, but it seems plausible that we will want that in many cases, and adhering to collection conventions gives us a way to keep that under control.

            Show
            jbosch Jim Bosch added a comment - After thinking about this more from the perspective of repositories with science users, where they will be ingesting HSC, DECam, CFHT data (etc) rather than us, I think I agree with time that we should include the collection in the UUID all the time. It leaves the door open for the same raw to be ingested into a repo multiple times, but it seems plausible that we will want that in many cases, and adhering to collection conventions gives us a way to keep that under control.
            Hide
            tjenness Tim Jenness added a comment -

            The consensus seems to be that we should play it safe and derive UUID5 for raw data ingest using collection name, dataset type, and dataId.

            Show
            tjenness Tim Jenness added a comment - The consensus seems to be that we should play it safe and derive UUID5 for raw data ingest using collection name, dataset type, and dataId.

              People

              Assignee:
              tjenness Tim Jenness
              Reporter:
              tjenness Tim Jenness
              Watchers:
              Andy Salnikov, Eli Rykoff, James Chiang, Jim Bosch, John Parejko, Kian-Tat Lim, Meredith Rawls, Michelle Gower, Robert Lupton, Tim Jenness, Yusra AlSayyad
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Planned End:

                  Jenkins

                  No builds found.