Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29300

Lock wait timeout exceeded while updating replica info by the Qserv Replication Controller

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The problem

      When the total number of replicas managed by the Qserv Replication System grows above thirteen limits the following problem is reported by the Master Replication Controller:

      % tail /qserv/qserv-prod/replication/log/qserv-replica-master-http.log
      ..
      2021-03-18T03:49:55.647Z  LWP 543   ERROR  DatabaseServicesMySQL::saveReplicaInfoCollection failed, exception: Connection[74]::execute(_inTransaction=1)  mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205
      terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error'
        what():  Connection[74]::execute(_inTransaction=1)  mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205
      /home/qserv/lsst: line 7:   456 Aborted                 (core dumped) $@`
      

      Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout

      The default value of the parameter is:

      SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
      +--------------------------+-------+
      | Variable_name            | Value |
      +--------------------------+-------+
      | innodb_lock_wait_timeout | 50    |
      +--------------------------+-------+
      

      Proposed solutions to the problem

      Since it's not possible to completely eliminate a probability of this scenario then a solution should be twofold:

      • Increasing a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service.
      • Improving the implementation of the MySQL API of the Replication system to allow automatic transaction restarts.

      Other options (in the longer run)

      Investigate the possibility of decreasing the overall duration of the replica update operations. One option here would be to revisit requirements for what (and how much data) needs to be stored in the database at each replica, redesign the schema of the relevant tables, and reduce the amount stored in the tables. For instance, the table replica_file that stores the full names of the MySQL table files could be re-normalized to replace the names with the "foreign keys" to file definitions.

        Attachments

          Issue Links

            Activity

            No builds found.
            gapon Igor Gaponenko created issue -
            gapon Igor Gaponenko made changes -
            Field Original Value New Value
            Link This issue is triggered by DM-29294 [ DM-29294 ]
            fritzm Fritz Mueller made changes -
            Epic Link DM-27790 [ 442306 ] DM-30610 [ 511885 ]
            fritzm Fritz Mueller made changes -
            Sprint DB_S21_12 [ 1065 ] DB_S21_12, DB_F21_06 [ 1065, 1103 ]

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Fritz Mueller, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:

                  CI Builds

                  No builds found.