# Lost connection to MySQL server during query in Qserv Ingest service

XMLWordPrintable

#### Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Story Points:
0.5
• Sprint:
DB_S21_12
• Team:
Data Access and Database
• Urgent?:
No

# Problem

Catalog ingest workflows starting multiple simultaneous (super-)transactions may trigger the following error condition at Ingest workers (as reported by the ingest clients):

 Error in method: POST url: http://localhost:25081/ingest/chunks http_code: 200 server_error: operation failed due to: Connection[117]::execute(_inTransaction=0) mysql_real_query failed, query: 'SELECT id,database,state,begin_time,end_time FROM transaction WHERE id=10440 ORDER BY begin_time DESC', error: Lost connection to MySQL server during query, errno: 2013 

At the same time, the MySQL server reports:

 2021-02-12 1:55:29 140119994169088 [Warning] Aborted connection 1270 to db: 'qservReplica' user: 'qsreplica' host: 'lsst-qserv-db08.ncsa.illinois.edu' (Got timeout reading communication packets) 

Other observations:

• The problem seem to affect the specific query.
• The problem is reproducible.
• Once a client connection gets into this state it never recovers from it (that points onto a possible bug in the error handling in the Replication system's MySQL connection management).

Reportedly, this code is typically seen when one of the timeout controlling parameters of MySQL server are two low. These were the values of the relevant parameters captured when the errors were detected:

 MariaDB [qservReplica]> SHOW VARIABLES LIKE '%timeout%'; +-----------------------------+----------+ | Variable_name | Value | +-----------------------------+----------+ | connect_timeout | 5 | | deadlock_timeout_long | 50000000 | | deadlock_timeout_short | 10000 | | delayed_insert_timeout | 300 | | innodb_flush_log_at_timeout | 1 | | innodb_lock_wait_timeout | 50 | | innodb_rollback_on_timeout | OFF | | interactive_timeout | 28800 | | lock_wait_timeout | 86400 | | net_read_timeout | 30 | | net_write_timeout | 60 | | slave_net_timeout | 60 | | thread_pool_idle_timeout | 60 | | wait_timeout | 28800 | +-----------------------------+----------+ 14 rows in set (0.01 sec) 

It's been recommended to increase values of the parameters connect_timeout, net_read_timeout, and net_write_timeout.

## Trying to reconfigure MySQL server

An attempt was made to increase values of the parameters at the MariaDB container's startup sequence:

  --connect_timeout=28800 \  --net_read_timeout=600 \  --net_write_timeout=600 \ 

Unfortunately, this didn't help.

## Study connection management and error reporting

Revisit error processing in the class lsst::qserv::replica::Connection to see if lost connections are properly re-established in case of failures.

The code looks good.

## Resolution (the actual bug)

The bug was found in the recently added support for transaction contribution. The bug was rather trivial and obvious.

#### Activity

Hide
Igor Gaponenko added a comment -

This was a simple case. Most of the time was spend on testing the effect of the fix at a scale while ingesting a (relatively) large size catalog (1.5 TB of input data, 16k files, 480 parallel transactions, 30 workers).

Show
Igor Gaponenko added a comment - This was a simple case. Most of the time was spend on testing the effect of the fix at a scale while ingesting a (relatively) large size catalog (1.5 TB of input data, 16k files, 480 parallel transactions, 30 workers).

#### People

Assignee:
Igor Gaponenko
Reporter:
Igor Gaponenko
Watchers:
Fritz Mueller, Igor Gaponenko, Nate Pease