Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-28812

Lost connection to MySQL server during query in Qserv Ingest service

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Problem

      Catalog ingest workflows starting multiple simultaneous (super-)transactions may trigger the following error condition at Ingest workers (as reported by the ingest clients):

      Error in method: POST url: http://localhost:25081/ingest/chunks
      http_code: 200 server_error: operation failed due to: Connection[117]::execute(_inTransaction=0)  mysql_real_query failed,
      query: 'SELECT `id`,`database`,`state`,`begin_time`,`end_time` FROM `transaction` WHERE `id`=10440 ORDER BY begin_time DESC',
      error: Lost connection to MySQL server during query, errno: 2013
      

      At the same time, the MySQL server reports:

      2021-02-12  1:55:29 140119994169088 [Warning] Aborted connection 1270 to db: 'qservReplica' user: 'qsreplica' host: 'lsst-qserv-db08.ncsa.illinois.edu' (Got timeout reading communication packets)
      

      Other observations:

      • The problem seem to affect the specific query.
      • The problem is reproducible.
      • Once a client connection gets into this state it never recovers from it (that points onto a possible bug in the error handling in the Replication system's MySQL connection management).

      Reportedly, this code is typically seen when one of the timeout controlling parameters of MySQL server are two low. These were the values of the relevant parameters captured when the errors were detected:

      MariaDB [qservReplica]> SHOW VARIABLES LIKE '%timeout%';
      +-----------------------------+----------+
      | Variable_name               | Value    |
      +-----------------------------+----------+
      | connect_timeout             | 5        |
      | deadlock_timeout_long       | 50000000 |
      | deadlock_timeout_short      | 10000    |
      | delayed_insert_timeout      | 300      |
      | innodb_flush_log_at_timeout | 1        |
      | innodb_lock_wait_timeout    | 50       |
      | innodb_rollback_on_timeout  | OFF      |
      | interactive_timeout         | 28800    |
      | lock_wait_timeout           | 86400    |
      | net_read_timeout            | 30       |
      | net_write_timeout           | 60       |
      | slave_net_timeout           | 60       |
      | thread_pool_idle_timeout    | 60       |
      | wait_timeout                | 28800    |
      +-----------------------------+----------+
      14 rows in set (0.01 sec)
      

      It's been recommended to increase values of the parameters connect_timeout, net_read_timeout, and net_write_timeout.

      Trying to reconfigure MySQL server

      An attempt was made to increase values of the parameters at the MariaDB container's startup sequence:

        --connect_timeout=28800 \
        --net_read_timeout=600 \
        --net_write_timeout=600 \
      

      Unfortunately, this didn't help.

      Study connection management and error reporting

      Revisit error processing in the class lsst::qserv::replica::Connection to see if lost connections are properly re-established in case of failures.

      The code looks good.

      Resolution (the actual bug)

      The bug was found in the recently added support for transaction contribution. The bug was rather trivial and obvious.

        Attachments

          Activity

          Hide
          gapon Igor Gaponenko added a comment -

          This was a simple case. Most of the time was spend on testing the effect of the fix at a scale while ingesting a (relatively) large size catalog (1.5 TB of input data, 16k files, 480 parallel transactions, 30 workers).

          Show
          gapon Igor Gaponenko added a comment - This was a simple case. Most of the time was spend on testing the effect of the fix at a scale while ingesting a (relatively) large size catalog (1.5 TB of input data, 16k files, 480 parallel transactions, 30 workers).

            People

            Assignee:
            gapon Igor Gaponenko
            Reporter:
            gapon Igor Gaponenko
            Watchers:
            Fritz Mueller, Igor Gaponenko, Nate Pease
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                CI Builds

                No builds found.