I've discovered various ways in which dataset deletion is unsafe or incomplete on
DM-21849, and as that's already ballooned into a megaticket, I'm punting a lot of that to here.
The main reason this is hard is that deletion is a fight between transactionality and foreign keys; we generally need three steps.
1) Use the Registry to identify a bunch of datasets to delete.
2) Actually delete those from the Datastore
3) Actually delete those from Registry.
We need to do (2) before (3) so we can be sure we've actually removed (e.g.) a file from storage before we delete its Registry representation; this is enforced by the foreign key from dataset_storage (aside: we should really rename that to dataset_location) to dataset, and the fact that has no ON DELETE clause.
But deleting from Datastore first is actually really scary:
- Datastore atomicity and exception safety is totally ad-hoc and implemented by us. It basically doesn't exist right now (e.g. https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/posixDatastore.py#L286). So if we delete one dataset artifact in Datastore, and then either Datastore fails on a later one or Registry fails due to some integrity constraint, we roll back the transaction, and BOOM: corrupt data repository.
- Datasets have a ton of relationships in Registry, so later failures due to integrity constraints are actually pretty likely. These include relationships to each other via composition and provenance and relationships to collections. Those are mostly well captured by foreign key constraints, and we can somewhat use ON DELETE rules to make sure step (3) doesn't fail as often. But it still can fail, for two reasons:
- If an ON DELETE rule triggers on a dataset we didn't tell Datastore to delete in step (2), we'll violate (for good reason) the dataset_storage foreign key that intentionally lacks an ON DELETE rule.
- While an ON DELETE rule can take care of the row that associates a dataset with a particular quantum, I think we need to delete the entire quantum that references a deleted dataset - otherwise our provenance is a lie, which is much worse than being nonexistent. As that's another table removed, I don't think there's an easy way to do this.
I also think I've heard cs2018 say that he's not a huge fan of ON DELETE rules in the first place.
I can think of a few angles with which to approach this:
- We could add sophistication to step (1) to make it guarantee that it's accounted for all Registry relationships when determining what we need to send to Datastore. It could even be responsible for actually deleting related non-dataset entities in a transaction in advance, passing that transaction onto steps (2) and (3) via a context manager. It certainly means we'd have to do a lot of querying on the datasets we're told to delete, even if the user just did a bunch of similar querying to find them.
- We can try to make at least PosixDatastore deletion more atomic, or at least make failures more recoverable (e.g. move files to temporary locations when "deleting them", then try to actually remove them during commit).
- We could require Datastores to be robust against failure-to-delete instead of requiring that their deletion be atomic and transactional, and hence declare the dataset_storage table the only source of truth on whether a dataset exists in a Datastore (so e.g. orphaned files may exist, but must be ignored/clobbered by Datastore when encountered). We'd then just commit the Registry deletion transaction before trying to delete from Datastore, and live with whatever level of success we can get there.
The last one feels like it might have to be at least a part of the solution if we really want strong guarantees, but there may be others, and if ktl wants to get more involved in Registry development, this may be a good place to start - I imagine this is an area he's got a lot more experience than I have.
More information can be found in the extensive TODO comments in Butler.purge (on the tickets/
DM-21849 branch right now) and I'll probably be adding some to ctrl_mpexec as well, as it looks like I'm going to have to add but not implement at least one command-line option because daf_butler lacks the deletion interfaces it will need (particularly deletion of entire RUN collections).
I'm setting Team to Arch for now because I know I won't get to it soon, but I figure I'll steal it back if it's not done after I finish a few other Registry tickets.