Details
-
Type:
Bug
-
Status: Done
-
Resolution: Done
-
Fix Version/s: None
-
Component/s: daf_butler
-
Story Points:1
-
Epic Link:
-
Team:Data Release Production
-
Urgent?:No
Description
When running a pipeline (at least through bps) the final mergeExecutionButler step will sometimes fail due to a database deadlock. A typical error looks like this:
|
|
lsst.daf.butler._butler INFO: Transferring 6635 datasets into Butler(collections=[], run=None, datastore='file:///repo/main/', registry='PostgreSQL@lsstdb1:main_20210215')
|
lsst.daf.butler.cli.utils ERROR: Caught an exception, details are in traceback:
|
Traceback (most recent call last):
|
File "/software/lsstsw/stack_20210813/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1771, in _execute_context
|
self.dialect.do_execute(
|
File "/software/lsstsw/stack_20210813/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
|
cursor.execute(statement, parameters)
|
psycopg2.errors.DeadlockDetected: deadlock detected
|
DETAIL: Process 13613 waits for ExclusiveLock on relation 203222 of database 16384; blocked by process 13779.
|
Process 13779 waits for RowShareLock on relation 202991 of database 16384; blocked by process 13613.
|
HINT: See server log for query details.
|
I tried again a couple of times and got similar error before it finally succeeded on the fourth try without my changing anything. I guess whatever was causing the deadlock had cleared up by then. I do note, however, that in each of the three failures the relation numbers in the error messages werethe same, though of course the process numbers were different. That is, the errors all said
Process XXXXX waits for ExclusiveLock on relation 203222 of database 16384; blocked by process YYYYY.
|
...
|
-Process YYYYY waits for RowShareLock on relation 202991 of database 16384; blocked by process XXXXX.
|
|
So XXXXX and YYYYY were naturally different in each job, but 203222 and 202991 were always the same, in case that helps.
Andy Salnikov, this is a tiny change, but because it's a subtle concurrency-related change, I'm hoping you can actually take a look at Database.sync in its entirety, and see if you agree with my conclusion that the outcome of a race condition that occurs before, after, or during it but within the same (possibly outer) transaction is either:
i.e. not any kind of silently bad write.
PR is here: https://github.com/lsst/daf_butler/pull/579
I'm going to repurpose DM-27224 for fixing the other rarer deadlocks noted in my last post; I'll comment there with my ideas.