Butler design requires that there be at most one Dataset in each Collection with a particular DatasetType and data ID. This is currently implemented in a very fragile, concurrency-unsafe way in the Python Registry classes. To make it robust, we need to implement it in the Registry database itself, which is tricky for several reasons:
- We don't have a single field that represents "the data ID" of a Dataset (but this is
- This might need to be a multi-table constraint (the alternative is denormalizing dataset_type_name and the
DM-14821packed data ID integer into the DatasetCollections join table).
- We've thus far assumed in various places that we can detect violations of this constraint before actually making any changes to the database or even triggering a rollback of an ongoing transaction. It's pretty clear we'll need to just not make the latter assumption. For the former assumption, we may just have to sometimes "orphan" Datasets that are successfully inserted into the Dataset table but not successfully associated with a Collection.
There's also a very relevant discussion of this on
Andy Salnikov, I've assigned this to you since you'd be much better than I would be at figuring out what kind of in-database constraint we want and how best to implement that in SQLAlchemy. But I've added a blocker on
DM-14821, which I'll need to do first, and I'm expecting us to work together (possibly via new blocker tickets assigned to me) on making sure the assumptions we make in the rest of the Butler are consistent with what we can actually implement robustly as a database constraint.