Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Story Points:3
-
Team:Architecture
-
Urgent?:No
Description
In DM-31956 deadlocks are reported when the end of workflow dataset transfer happens simultaneously. This is seemingly caused by the use of register collection inside a long-lived transaction.
The simplest fix is to move the transaction inside the per run/datasetType loop and have it include solely the dataset registry import and the datastore registration.
Upsides:
- The transaction will be smaller.
- Failures for some run/dataset type combinations could still allow other datasets to be transferred (this can also be a downside). Should it raise an exception after it has done what it could?
Downsides:
- If dataset import fails the collection will remain and be empty. We already are in that situation with dataset types. Should an empty run collection be removed on failure?
- A transfer failure could still have transferred some data but not others.
A middle ground would be to create any collections up front and then leave the transaction where it is. If the import fails for any reason, catch it, remove the run collections that were created (and also dataset types?) and reraise the exception. This would still have a long exception but would leave registry and datastore in a consistent state (although when the time comes that we are actually transferring data between buckets a rollback is going to be really really slow since it would have to delete hundred of thousands of datasets it had spent ages copying in).
Attachments
Issue Links
- relates to
-
DM-31956 mergeExecutionButler task hits database deadlock intermittently
- Done
I've left some follow-up comments on the daf_butler PR, but I think any remaining changes at this point are quite straightforward.