Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28047

Master Replication Controller crashes when updating replica info in the database

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The problem

      Master Replication Controller may crash whene updating the persistent state of replicas in the database. This is reported in the controller's log file as:

      2020-10-28T07:13:12.431Z  LWP 516   DEBUG  JOB     c42569e7-21d3-40ae-9aee-5e7487d0e309  FIND_ALL  IN_PROGRESS::NONE  _onRequestFinish  database=wise_2band_00 worker=db24 _numLaunched=390 _numFinished=71 _numSuccess=71
      terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error'
        what():  Connection[127]::execute(_inTransaction=1)  mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'gaia_sourceFullOverlap_10252.MYD',6569632,1603867865,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205
      /home/qserv/lsst: line 7:   438 Aborted                 (core dumped) $@
      

      It turns out the crash may happen if two or more parallel instances of job FindReplicasJob are updating the persistent state of the replicas. Chances of running into this situation are high when the Controller is processing requests of the catalog ingest workflows while the Replication system's replica scanner job is updating replica info in the database.

      Reproducing the problem

      Run at least two instances of the following job (insignificant details are ommited):

      
      

      The proposed solution

      Reinforce an implementation of the DatabaseServices class to recognize this exception and repeat failed operations.

        Attachments

          Activity

          Show
          gapon Igor Gaponenko added a comment - PR: https://github.com/lsst/qserv/pull/592
          Hide
          salnikov Andy Salnikov added a comment -

          Looks OK, though I'm not sure that this is most reliable way to fix deadlocks. Retrying the same thing again has a chance that you will run into the same deadlock again sometime in the future. It may be worth to try to understand why deadlocks happen and fix that at the source.

          Show
          salnikov Andy Salnikov added a comment - Looks OK, though I'm not sure that this is most reliable way to fix deadlocks. Retrying the same thing again has a chance that you will run into the same deadlock again sometime in the future. It may be worth to try to understand why deadlocks happen and fix that at the source.
          Hide
          gapon Igor Gaponenko added a comment - - edited

          Andy Salnikov thank you for the review and for the comments! There are two InnoDB tables that are queried and/or updated from two (or many) transactions. The tables are linked by the PK/FK relationship. Each thread would start a transaction and do the same sequence of steps on different subsets of entries from the tables:

          1. SELECT from the main table (the one that has PK) and the dependent rows from the dependant table
          2. analyze the payload from the tables and cross-check it with what's reported from the workers
          3. (optionally) DELETE stale rows from both tables
          4. (optionally) INSERT new/updated rows into both tables

          It's not clear to me why MySQL runs into problems with this algorithm. If I knew it then I might certainly try experimenting with the algorithm to come up with a solution that won't require restarting a transaction.

          Show
          gapon Igor Gaponenko added a comment - - edited Andy Salnikov thank you for the review and for the comments! There are two InnoDB tables that are queried and/or updated from two (or many) transactions. The tables are linked by the PK/FK relationship. Each thread would start a transaction and do the same sequence of steps on different subsets of entries from the tables: SELECT from the main table (the one that has PK) and the dependent rows from the dependant table analyze the payload from the tables and cross-check it with what's reported from the workers (optionally) DELETE stale rows from both tables (optionally) INSERT new/updated rows into both tables It's not clear to me why MySQL runs into problems with this algorithm. If I knew it then I might certainly try experimenting with the algorithm to come up with a solution that won't require restarting a transaction.

            People

            Assignee:
            gapon Igor Gaponenko
            Reporter:
            gapon Igor Gaponenko
            Reviewers:
            Andy Salnikov
            Watchers:
            Andy Salnikov, Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                CI Builds

                No builds found.