Fix Version/s: None
When running ap_pipe on hits2015 with slurm, several of the tasks ran into a problem with accessing the association database with sqlite3. It says "sqlite3.OperationalError: database is locked."
After brief discussion with Chris Morrison [X], the problem likely stems from how the code creates temporary tables when querying for DIASources or DIAObjects. These temporary tables are joined into the main DIASource and DIAObject tables to enable queries. When run in parallel, however, it's possible for these temporary tables to be created and dropped in the wrong order.
This ticket is to implement a fix in ap_association.
- relates to
DM-14259 Try and document running ap_pipe on the Verification Cluster with SLURM
Meredith Rawls was able to rerun the complete HiTS, ap_verification dataset through the verification cluster without any failures. This was through a combination of increasing the allowed time before a time out on queries and switching from directly using the db cursor to using the sqlite3 connection as a context manager.
I'm no sqlite expert, but the use of the connection context manager here looks like a definite improvement.
One question is your use of CREATE inside the context: CREATE is a Data Defitinion Language statement, which means it will not be rolled back in the event of an exception (assuming I'm understanding this stackoverflow post correctly). I think you can make that happen by adding conn.execute('begin') at the start of all your database interactions, so that there is an explicit start point to the transaction.
Thanks John Parejko, I've added the BEGIN in front of all places where a table is created. I also changed some of the earlier CREATE TABLE statements to CREATE TABLE IF NOT EXISTS to get rid of the python if statement that checked if the tables were created previously.
Everything unittests and seems to work. I'm having Meredith Rawls rerun the HiTS analysis again to make sure that things run to their conclusion. After a successful run I will merge the ticket.
Started work. Turns out that the temporary tables are not exactly the problem as "database locked" failures when Meredith Rawls runs 6 nodes with 84 jobs on the lsst-dev verify cluster. I have increased the time out of the database connection from the default 5 seconds to 60 and additionally simplified the commits around temporary tables for the joins in querying DIASources/Objects with diaObjectId and pixelId respectively.