Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-17495

Running PipelineTasks in parallel can lead to aborting on locks instead of blocking

    XMLWordPrintable

    Details

      Description

      Running PipelineTasks with ctrl_mpexec with -j can lead to SQLite errors about the database being locked.  It's unclear whether it's just not blocking long enough or not blocking at all, and also unclear whether we need to fix this by adding our own blocking or somehow informing SQLAlchemy or SQLite to do it.

      (It's also possible we do have a genuine deadlock bug, but I doubt it.)

        Attachments

          Activity

          Hide
          jbosch Jim Bosch added a comment -

          Nate Lust has a workaround for this that involves creating new connections for at least every DB-mutating registry operation, which I've put on branch u/jbosch/DM-17495/nate-closes-connections.  The most obvious downside of this is that it makes the last few operations in gen2convert (either updating VisitDetectorRegion, adding curated calibrations, or both) extremely slow.

          We think we've learned a few things from the success of that approach in resolving the deadlocks (and the failure of some of our other ideas):

          • This is not just blocking timeouts - there are genuine deadlocks.
          • Deadlock can occur when two concurrent connections first obtain a SHARED lock and then try to upgrade that to a PENDING/EXCLUSIVE lock, as we frequently do when running a SELECT followed by an INSERT within a single transaction.
          • This is exacerbated by the fact that SQLite doesn't actually proceed to acquiring the PENDING/EXCLUSIVE lock when you try to commit - instead it waits until its memory cache is full, and only then really does the commit.  That makes it very difficult to reason about when the code will actually try to obtain locks of different types.  It also seems to be what Nate's fix most directly addresses - closing a connection also causes the commit to be executed immediately.

          Nate also tried write-ahead locking (without the multiple connections changes, I believe), and these had no effect in isolation.

           

          The first thing I'm going to try is to replace our usage of SELECT-then-INSERT with a direct INSERT inside a try block to catch conflicts, as Andy Salnikov has long advised.  I think that's worth doing regardless of whether it's enough to fix the problem on its own.  If it isn't, I'd like to try it with write-ahead locking and starting transactions with BEGIN IMMEDIATE.  If I'm understanding how SQLite is doing this correctly (a huge "if"), that may do it.

          Show
          jbosch Jim Bosch added a comment - Nate Lust has a workaround for this that involves creating new connections for at least every DB-mutating registry operation, which I've put on branch u/jbosch/ DM-17495 /nate-closes-connections .  The most obvious downside of this is that it makes the last few operations in gen2convert (either updating VisitDetectorRegion, adding curated calibrations, or both) extremely slow. We think we've learned a few things from the success of that approach in resolving the deadlocks (and the failure of some of our other ideas): This is not just blocking timeouts - there are genuine deadlocks. Deadlock can occur when two concurrent connections first obtain a SHARED lock and then try to upgrade that to a PENDING/EXCLUSIVE lock, as we frequently do when running a SELECT followed by an INSERT within a single transaction. This is exacerbated by the fact that SQLite doesn't actually proceed to acquiring the PENDING/EXCLUSIVE lock when you try to commit - instead it waits until its memory cache is full, and only then really does the commit.  That makes it very difficult to reason about when the code will actually try to obtain locks of different types.  It also seems to be what Nate's fix most directly addresses - closing a connection also causes the commit to be executed immediately. Nate also tried write-ahead locking (without the multiple connections changes, I believe), and these had no effect in isolation.   The first thing I'm going to try is to replace our usage of SELECT-then-INSERT with a direct INSERT inside a try block to catch conflicts, as Andy Salnikov has long advised.  I think that's worth doing regardless of whether it's enough to fix the problem on its own.  If it isn't, I'd like to try it with write-ahead locking and starting transactions with BEGIN IMMEDIATE.  If I'm understanding how SQLite is doing this correctly (a huge "if"), that may do it.
          Hide
          jbosch Jim Bosch added a comment -

          Ready for review.

          Nate Lust, this is what you tested, just rebased on master (after the BrightObjectMask merge).

          Summary for posterity:

          • BEGIN IMMEDIATE for transactions in SQLite is definitely necessary; without that it's too easy to have multiple simultaneous "SHARED" locks that need to upgrade to "RESERVED/PENDING/EXCLUSIVE" before they release, and that leads to deadlocks.  With BEGIN IMMEDIATE, a "RESERVED" lock is acquired immediately, meaning that other write operations are blocked (but not deadlocked).  In the future we may want to work harder to minimize the amount of time spent in transactions to reduce that blocking (instead of using the transactional decorator liberally).
          • As promised above, I also switched (in several places) how we avoid uniqueness conflicts.  Prior to this ticket, we typically performed a SELECT first to look for an existing conflict, then followed that with an INSERT only if a duplicate was not found.  Now we typically try to INSERT first, catch DB-level constraint violations as Python exceptions, and then (if needed) follow that up with a SELECT to identify which constraint was violated.  It's not clear from our testing whether this was necessary to make it all work, or if BEGIN IMMEDIATE would have been enough on its own, but it should at least minimize the time spent in transactions in most cases, and it seems safer as well.

           

          Show
          jbosch Jim Bosch added a comment - Ready for review. Nate Lust , this is what you tested, just rebased on master (after the BrightObjectMask merge). Summary for posterity: BEGIN IMMEDIATE for transactions in SQLite is definitely necessary; without that it's too easy to have multiple simultaneous "SHARED" locks that need to upgrade to "RESERVED/PENDING/EXCLUSIVE" before they release, and that leads to deadlocks.  With BEGIN IMMEDIATE , a "RESERVED" lock is acquired immediately, meaning that other write operations are blocked (but not deadlocked).  In the future we may want to work harder to minimize the amount of time spent in transactions to reduce that blocking (instead of using the transactional decorator liberally). As promised above, I also switched (in several places) how we avoid uniqueness conflicts.  Prior to this ticket, we typically performed a SELECT first to look for an existing conflict, then followed that with an INSERT only if a duplicate was not found.  Now we typically try to INSERT first, catch DB-level constraint violations as Python exceptions, and then (if needed) follow that up with a SELECT to identify which constraint was violated.  It's not clear from our testing whether this was necessary to make it all work, or if BEGIN IMMEDIATE would have been enough on its own, but it should at least minimize the time spent in transactions in most cases, and it seems safer as well.  
          Hide
          jbosch Jim Bosch added a comment -

          Just added a branch for obs_base to deal with fallout from the exception type changes: https://github.com/lsst/obs_base/pull/130

          Show
          jbosch Jim Bosch added a comment - Just added a branch for obs_base to deal with fallout from the exception type changes: https://github.com/lsst/obs_base/pull/130

            People

            Assignee:
            jbosch Jim Bosch
            Reporter:
            jbosch Jim Bosch
            Reviewers:
            Nate Lust
            Watchers:
            Jim Bosch, Nate Lust
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.