I have not thought about this in much depth, and don't plan to until Andy Salnikov has had a chance to look at it, but here are some quick thoughts on Tim Jenness's questions:
- Where are UUIDs calculated? By Registry?
- Should we allow an external source for UUID?
- Is there a case for datastore to allocate IDs itself which can then be used by registry?
I think a big advantage of UUIDs (if they work the way I think they might) is that they could come from anywhere, including external sources, and potentially datastore (someday), so I think we want interfaces that leave room for that (e.g. all ingest/import operations should be able to accept and use an existing UUID, with the caller guaranteeing that those IDs meet whatever specs for UUIDs we come up with). But I yet don't see a need for anything other than a Registry implementation for generating them in the near future.
Should we allow UUIDs to be calculated from a datasetRef (dataset type / dataId and run name). This could be an alternative implementation of DM-21794, allowing raw data to have predictable IDs.
I think that's a good generalization for meeting the raw-data use case, and there may be a few other cases where we'd want to use that type of ID, because the definition of the dataset really can be globally consistent. But we definitely don't want this to be the only way to generate UUIDs, because I think we really want UUID equality across different repos to mean "these datasets really are identical", and the dataset type and run name (in particular) have to be very special for them to be sufficient for global uniqueness.
Regarding datastore issuing UUIDs. I was more thinking that pipe_base would generate all the dataset UUIDs when it forms the lightweight registry. Then datastore would use whatever ID it finds in the DatasetRef that it's told to store. Do we really have a need for datastore to be used without an associated Butler?
This sounds totally reasonable for now (and maybe forever), though I'm not sure I understood what you meant by the last question.
But I do feel like there's one bit of information that falls through the cracks a bit in our staging-repo processing plans, both with and without UUIDs: the datastore-internal records, like the formatter and read storage class information. That's mostly okay, because we can mostly rely on the workflow system to make sure the configuration that determines those values is the same in the job repo that actually puts the datasets and the shared repo we access them from later. But it means taking maintainance of what should be Datastore's internal invariants out of its hands and giving it to much higher-level workflow code where many more things can go wrong, and I feel like it might be a better long-term plan to write the datastore records to a scalable NoSQL database at the same time that we write the file itself to the object store, and never try to put them into the PostgreSQL database later.
That also ties in with my still-vague thoughts about getting rid of dataset_location and reworking our notion of what the source of truth is for dataset existence and deletion - basically, if datastore records (as well as files) can exist first, with their own UUID-based dataset_id values, totally independent of a Registry (but with datastore records and files always written in concert), then I think we get to the kind of really decoupled system we hoped for originally, where Registry associates UUIDs with descriptive metadata and organizational structure, Datastore associates UUIDs with files and the information needed to read them, and Registry and Datastore don't have to talk to each other at all.
But that depends on UUIDs being really magical about avoiding collisions, and I don't claim to understand the constraints they impose in order to pull that off, so I have no idea if that all hangs together (or if it raises new problems that make it not a good tradeoff). Hopefully Andy Salnikov will be able to tell us.
Regarding datastore issuing UUIDs. I was more thinking that pipe_base would generate all the dataset UUIDs when it forms the lightweight registry. Then datastore would use whatever ID it finds in the DatasetRef that it's told to store. Do we really have a need for datastore to be used without an associated Butler?