Status: To Do
Fix Version/s: None
Team:Data Access and Database
When the total number of replicas managed by the Qserv Replication System grows above thirteen limits the following problem is reported by the Master Replication Controller:
Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout
The default value of the parameter is:
Since it's not possible to completely eliminate a probability of this scenario then a solution should be twofold:
- Increasing a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service.
- Improving the implementation of the MySQL API of the Replication system to allow automatic transaction restarts.
Investigate the possibility of decreasing the overall duration of the replica update operations. One option here would be to revisit requirements for what (and how much data) needs to be stored in the database at each replica, redesign the schema of the relevant tables, and reduce the amount stored in the tables. For instance, the table replica_file that stores the full names of the MySQL table files could be re-normalized to replace the names with the "foreign keys" to file definitions.
|Sprint||DB_S21_12 [ 1065 ]||DB_S21_12, DB_F21_06 [ 1065, 1103 ]|