# Master Replication Controller crashes when updating replica info in the database

XMLWordPrintable

#### Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Story Points:
3
• Epic Link:
• Sprint:
DB_S21_12
• Team:
Data Access and Database
• Urgent?:
No

# The problem

Master Replication Controller may crash whene updating the persistent state of replicas in the database. This is reported in the controller's log file as:

 2020-10-28T07:13:12.431Z LWP 516 DEBUG JOB c42569e7-21d3-40ae-9aee-5e7487d0e309 FIND_ALL IN_PROGRESS::NONE _onRequestFinish database=wise_2band_00 worker=db24 _numLaunched=390 _numFinished=71 _numSuccess=71 terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error'  what(): Connection[127]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO replica_file VALUES (LAST_INSERT_ID(),'gaia_sourceFullOverlap_10252.MYD',6569632,1603867865,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 /home/qserv/lsst: line 7: 438 Aborted (core dumped) \$@ 

It turns out the crash may happen if two or more parallel instances of job FindReplicasJob are updating the persistent state of the replicas. Chances of running into this situation are high when the Controller is processing requests of the catalog ingest workflows while the Replication system's replica scanner job is updating replica info in the database.

## Reproducing the problem

Run at least two instances of the following job (insignificant details are ommited):

  

## The proposed solution

Reinforce an implementation of the DatabaseServices class to recognize this exception and repeat failed operations.

#### Activity

Hide
Igor Gaponenko added a comment -
Show
Igor Gaponenko added a comment - PR: https://github.com/lsst/qserv/pull/592
Hide
Andy Salnikov added a comment -

Looks OK, though I'm not sure that this is most reliable way to fix deadlocks. Retrying the same thing again has a chance that you will run into the same deadlock again sometime in the future. It may be worth to try to understand why deadlocks happen and fix that at the source.

Show
Andy Salnikov added a comment - Looks OK, though I'm not sure that this is most reliable way to fix deadlocks. Retrying the same thing again has a chance that you will run into the same deadlock again sometime in the future. It may be worth to try to understand why deadlocks happen and fix that at the source.
Hide
Igor Gaponenko added a comment - - edited

Andy Salnikov thank you for the review and for the comments! There are two InnoDB tables that are queried and/or updated from two (or many) transactions. The tables are linked by the PK/FK relationship. Each thread would start a transaction and do the same sequence of steps on different subsets of entries from the tables:

1. SELECT from the main table (the one that has PK) and the dependent rows from the dependant table
2. analyze the payload from the tables and cross-check it with what's reported from the workers
3. (optionally) DELETE stale rows from both tables
4. (optionally) INSERT new/updated rows into both tables

It's not clear to me why MySQL runs into problems with this algorithm. If I knew it then I might certainly try experimenting with the algorithm to come up with a solution that won't require restarting a transaction.

Show
Igor Gaponenko added a comment - - edited Andy Salnikov thank you for the review and for the comments! There are two InnoDB tables that are queried and/or updated from two (or many) transactions. The tables are linked by the PK/FK relationship. Each thread would start a transaction and do the same sequence of steps on different subsets of entries from the tables: SELECT from the main table (the one that has PK) and the dependent rows from the dependant table analyze the payload from the tables and cross-check it with what's reported from the workers (optionally) DELETE stale rows from both tables (optionally) INSERT new/updated rows into both tables It's not clear to me why MySQL runs into problems with this algorithm. If I knew it then I might certainly try experimenting with the algorithm to come up with a solution that won't require restarting a transaction.

#### People

Assignee:
Igor Gaponenko
Reporter:
Igor Gaponenko
Reviewers:
Andy Salnikov
Watchers:
Andy Salnikov, Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
Votes:
0 Vote for this issue
Watchers:
5 Start watching this issue

#### Dates

Created:
Updated:
Resolved: