Nate Lust has a workaround for this that involves creating new connections for at least every DB-mutating registry operation, which I've put on branch u/jbosch/
DM-17495/nate-closes-connections. The most obvious downside of this is that it makes the last few operations in gen2convert (either updating VisitDetectorRegion, adding curated calibrations, or both) extremely slow.
We think we've learned a few things from the success of that approach in resolving the deadlocks (and the failure of some of our other ideas):
- This is not just blocking timeouts - there are genuine deadlocks.
- Deadlock can occur when two concurrent connections first obtain a SHARED lock and then try to upgrade that to a PENDING/EXCLUSIVE lock, as we frequently do when running a SELECT followed by an INSERT within a single transaction.
- This is exacerbated by the fact that SQLite doesn't actually proceed to acquiring the PENDING/EXCLUSIVE lock when you try to commit - instead it waits until its memory cache is full, and only then really does the commit. That makes it very difficult to reason about when the code will actually try to obtain locks of different types. It also seems to be what Nate's fix most directly addresses - closing a connection also causes the commit to be executed immediately.
Nate also tried write-ahead locking (without the multiple connections changes, I believe), and these had no effect in isolation.
The first thing I'm going to try is to replace our usage of SELECT-then-INSERT with a direct INSERT inside a try block to catch conflicts, as Andy Salnikov has long advised. I think that's worth doing regardless of whether it's enough to fix the problem on its own. If it isn't, I'd like to try it with write-ahead locking and starting transactions with BEGIN IMMEDIATE. If I'm understanding how SQLite is doing this correctly (a huge "if"), that may do it.