Fix Version/s: None
https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/datastores/s3Datastore.py#L199-L205 checks to see if an object exists before writing it. With AWS and likely other object stores, relaxed consistency guarantees mean that there could be a race condition here if multiple simultaneous jobs are trying to write. Furthermore, the additional API call is pretty wasteful as it is almost always going to return false, and it may help trigger rate limiting.
We should consider whether we can live with overwriting of a dataset.
- is blocked by
DM-24247 butler validation error in ci_hsc_gen3
- is triggering
DM-26278 Investigate situation of deleting and adding to the same butler collection simultaneously
- To Do
Note that lookups in internal records are expensive if done individually in many-dataset contexts like ingest, and would probably be subject to race conditions as well unless we're very careful (which may require new Database-level APIs, not just Datastore changes).
We can get safe overwrites purely just by selecting a versioned bucket at bucket creation time. Then overwrites update the old file, but the old object is still retrievable. Then deleting doesn't - it instead reverts to an older version of the stored object. There are no existing tools, however, that facilitate working with objects in versioned buckets in datastore so human intervention will be required on bad writes/overwrites. Note also that this doesn't resolve the issue of parallel writes - it's only clear that if the upload was successful, eventually both versions of the files will be written correctly, but which version will be the latest version (i.e. front facing one) is not clear.
For my personal interest, what sparked this issue? Has anyone hit throttling?
Ingest uses a slightly different code path to put and I see that for ingest where we are copying from local file to S3 we don't even try to check if the file exists already in S3. It still suffers from the same problem of misconfigured file template generating names that are not unique.
That would be on me. The way I originally considered ingest is that it's "special" in the sense it will only ever happen once at clean start of repo. I understand now that's not necessarily true and other use cases exist. I just never remembered that it might need fixing. I can volunteer to fix both mistakes. How depends, I guess, on what the consensus on overwriting behavior we (you...) reach here.
Kian-Tat Lim Following the discussion with Google where we have problems with pre-emptible jobs leaving files hanging around in S3, I've changed datastore so that it trusts registry when it's told to write a file and always overwrites. For now I issue a log warning but eventually we can remove that when we are comfortable. To protect against bad templates I also now do a mandatory template validation before starting the put.
I haven't thought too deeply about trash/emptyTrash/put since I'm not sure what registry is doing on that side of things. Maybe Jim Bosch can comment. Is it possible for a datastore.trash to happen on this dataset, then having a new put occurring at the same time as some other process is doing the emptyTrash (which could remove a file that was just written). This requires that the new put and the old dataset share the same dataId and collection. The files are written before we update the internal registry table so it's not beyond the realms of possibility for an emptyTrash to remove the same dataset.
Dino Bektesevic ingest has been completely rewritten so don't worry about it.
Is it possible for a datastore.trash to happen on this dataset, then having a new put occurring at the same time as some other process is doing the emptyTrash (which could remove a file that was just written). This requires that the new put and the old dataset share the same dataId and collection.
I believe it is possible; once the dataset has been trashed, the registry will allow a new one with the same dataset_type, collection, and data ID. And the result would be that the datastore internal records and the locations table both claim the dataset exists, even though it really doesn't. That's a pretty mild form of repo corruption for something that can only happen if someone is already doing something pretty inadvisable, and I think it's totally recoverable in the sense that you can just trash and empty trash again on those data IDs. But it's also a fairly hard one to detect (you actually have to try to get the dataset to notice something has gone awry), so if we're concerned that someone is going to do something inadvisable, it may be worth working harder to address it.
The simple, brute-force way to fix this would be to always include the dataset_id in the filename template; that will just avoid the clash entirely.
Another approach would involve defining a UNIQUE constraint in the internal datastore records that would prohibit that concurrent put. The constraint would need to at least include the URI and whatever columns need to go with that to cover entries that should be allowed to have the same URI (e.g. for composition). With that in place, I think a concurrent put that happens just before the trash is emptied will fail at the Registry level (perhaps not before it actually writes to datastore storage, but then that's just one more case of "don't trust datastore", because the database transaction will still roll back both the internal datastore records insertion and the dataset_location insertion).
Looks OK as far as it goes. I take no position on the potential conflicts with "trash" that I think are beyond the scope here.
I'll file a new ticket to ponder the "deleting from a collection in one process whilst replacing that data in the same collection from another" problem. It sounds fraught regardless.
The overwriting risk exists in the context of the file template being incomplete or the dataset being marked for deletion but emptyTrash not being called yet. I think a lot of this could be mitigated by datastore querying its record table to first see if a file of that name already exists but associated with another dataset_id. If it does exist but the file is marked to be trashed then we could empty the trash for that old ref and then continue. If it does exist but that ref is not marked for deletion we complain. Then in the S3 code I think we can safely overwrite if we get that far.