Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-15114

assoc_db_sqlite sometimes encounters "database is locked"

    XMLWordPrintable

    Details

      Description

      When running ap_pipe on hits2015 with slurm, several of the tasks ran into a problem with accessing the association database with sqlite3. It says "sqlite3.OperationalError: database is locked."

      After brief discussion with Chris Morrison [X], the problem likely stems from how the code creates temporary tables when querying for DIASources or DIAObjects. These temporary tables are joined into the main DIASource and DIAObject tables to enable queries. When run in parallel, however, it's possible for these temporary tables to be created and dropped in the wrong order.

      This ticket is to implement a fix in ap_association.

        Attachments

          Issue Links

            Activity

            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment - - edited

            Started work. Turns out that the temporary tables are not exactly the problem as "database locked" failures when Meredith Rawls runs 6 nodes with 84 jobs on the lsst-dev verify cluster. I have increased the time out of the database connection from the default 5 seconds to 60 and additionally simplified the commits around temporary tables for the joins in querying DIASources/Objects with diaObjectId and pixelId respectively.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - - edited Started work. Turns out that the temporary tables are not exactly the problem as "database locked" failures when Meredith Rawls runs 6 nodes with 84 jobs on the lsst-dev verify cluster. I have increased the time out of the database connection from the default 5 seconds to 60 and additionally simplified the commits around temporary tables for the joins in querying DIASources/Objects with diaObjectId and pixelId respectively.
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            Meredith Rawls was able to rerun the complete HiTS, ap_verification dataset through the verification cluster without any failures. This was through a combination of increasing the allowed time before a time out on queries and switching from directly using the db cursor to using the sqlite3 connection as a context manager.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - Meredith Rawls was able to rerun the complete HiTS, ap_verification dataset through the verification cluster without any failures. This was through a combination of increasing the allowed time before a time out on queries and switching from directly using the db cursor to using the sqlite3 connection as a context manager.
            Hide
            Parejkoj John Parejko added a comment -

            I'm no sqlite expert, but the use of the connection context manager here looks like a definite improvement.

            One question is your use of CREATE inside the context: CREATE is a Data Defitinion Language statement, which means it will not be rolled back in the event of an exception (assuming I'm understanding this stackoverflow post correctly). I think you can make that happen by adding conn.execute('begin') at the start of all your database interactions, so that there is an explicit start point to the transaction.

            Show
            Parejkoj John Parejko added a comment - I'm no sqlite expert, but the use of the connection context manager here looks like a definite improvement. One question is your use of CREATE inside the context: CREATE is a Data Defitinion Language statement, which means it will not be rolled back in the event of an exception (assuming I'm understanding this stackoverflow post correctly). I think you can make that happen by adding conn.execute('begin') at the start of all your database interactions, so that there is an explicit start point to the transaction.
            Hide
            cmorrison Chris Morrison [X] (Inactive) added a comment -

            Thanks John Parejko, I've added the BEGIN in front of all places where a table is created. I also changed some of the earlier CREATE TABLE statements to CREATE TABLE IF NOT EXISTS to get rid of the python if statement that checked if the tables were created previously.

            Everything unittests and seems to work. I'm having Meredith Rawls rerun the HiTS analysis again to make sure that things run to their conclusion. After a successful run I will merge the ticket.

            Show
            cmorrison Chris Morrison [X] (Inactive) added a comment - Thanks John Parejko , I've added the BEGIN in front of all places where a table is created. I also changed some of the earlier CREATE TABLE statements to CREATE TABLE IF NOT EXISTS to get rid of the python if statement that checked if the tables were created previously. Everything unittests and seems to work. I'm having Meredith Rawls rerun the HiTS analysis again to make sure that things run to their conclusion. After a successful run I will merge the ticket.

              People

              Assignee:
              cmorrison Chris Morrison [X] (Inactive)
              Reporter:
              mrawls Meredith Rawls
              Reviewers:
              John Parejko
              Watchers:
              Chris Morrison [X] (Inactive), John Parejko, Meredith Rawls
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.