We currently require a dataset to exist in Registry in order for it to exist in a Datastore's internal records or the dataset_location table, via a foreign key. Maintaining that constraint when datasets are deleted is tricky, leading to a "trash" system that:
- is problematic in multi-user contexts (one user's delete operation can attempt to empty trash left over from a different user's failure - which might be disastrous if the different users have different permissions on the backing storage);
- sort of defeats the purpose of the constraint anyway, allowing datasets to be deleted from the Registry without actually deleting them from Datastore's backing storage by allowing them to merely be trashed in Datastore's records.
In addition, all of our schemes for shared-database-free batch execution (see DMTN-177) also violate this constraint in spirit, by allowing Datastore files to exist long before their database records (for either Registry or Datastore side) exist.
And, finally, one of the main reasons we had this constraint - preservation of unique dataset IDs by delegating their creation to Registry - has effectively gone away with UUIDs.
A better solution to this problem needs serious thought, but I think it ought to involve the following principles:
- A dataset may exist in either Registry or Datastore without existing in the other. For example, in no-shared-database batch, output datasets will exist only in Datastore until they are "transferred back", and some intermediate datasets may be deleted from a Datastore (or never moved to permanent storage) while being retained in Registry for provenance.
- If a dataset's file exists in a Datastore, its records must exist in some location managed by the data repository (findable via configuration in the data repository root). That will often be the Registry database (after processing outputs are transferred back), but it could also be a per-quantum file during execution, a per-run file used in Rucio-Butler communications, or some kind of future more scalable database used during execution. The important point here is that Datastore should never have to check underlying storage for file existence to satisfy an existence check call, except perhaps in some limited failure-recovery modes where we have reason to believe that things have gotten out of sync.
- We should only provide atomic-operation guarantees in deletion that we can easily and efficiently delegate to the systems we use under the hood (i.e. filesystems, object stores, and SQL databases); we just need to be clear about the guarantees that we do provide. I think a database transaction rollback does count as "easy and efficient", while moving files in an object store to a temporary location prior to actually deleting them (in order to make that deletion reversible) probably does not, and doing something similar with hard-links on filesystems that support them is somewhere in between.
I think we should also consider dropping the dataset_location, and instead have:
- a Python interface for vectorized Datastore existence and storage-metadata checks (e.g. URI, checksum, file size) that's accessible to the Registry query system through the bridge interface;
- an interface for providing access to SQLAlchemy table or subqueries that provide Datastore records with these columns (which would return none when records are not in the Registry, or indicate in some other way when some records may not be in the Registry).
This ticket is not necessarily the one where we do the work - I expect its output product to be a some combination of technote, confluence page, and follow-up tickets.
tjenness, this is something I've been thinking about in a shallow sense for quite a while, and this ticket is my attempt to capture the thinking I have done in case you'd like to take it the rest of the way (or push back on parts you disagree with). I'm also happy to pick it up myself later, but I'm not planning to do that until the
DM-31725 (etc) query system work is complete.