Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29294

Preparing catalog KPM50 for the performance/scalability testing of Qserv

    XMLWordPrintable

    Details

    • Type: Story
    • Status: In Progress
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Objectives

      This effort is meant to document extra (but required) data management actions to be taken for the earlier ingested catalog KPM50 before proceeding with the tests.

      Creating table-level indexes at workers

      This is needed because the current implementation of the Qserv Replication/Ingest system won't automatically create any table-level indexes neither at the super-transaction commit time nor at the catalog publishing time. Reasons for this are explained in the documentation for the Replication/Ingest system. The indexes are still required in order to achieve a reasonable performance for the queries and to set a ground for the fair comparison of the Qserv's performance against the earlier (smaller scale) KPM tests (KPM30, KPM20 and KPM10).

      As a minimum, the following indexes on the chunked tables need to be created:

      table overlap column unique
      Object no objectId yes
      Object no psRa no
      Object no psDecl no
      Source no objectId no
      Source no psRa no
      Source no psDecl no
      ForcedSource no objectId no

      Increasing the minimum replication level to 2

      Initially, the catalog was ingested with replication level 1. This poses the following problems:

      • There is a risk of a permanent partial loss of the valuable data created in a course of a substantial (multi-weeks) effort due to filesystem failures (data corruption, etc.) or the whole worker losses. Should this happen a significant effort to regenerate the lost data and re-ingest them into the catalog will be required. This will be problematic since the current implementation of the Qserv Ingest system still doesn't support an easy way for extending or patching catalogs. This is supposed to be solved by DM-28626.
      • Interruptions of the testing due to the temporary loss of single workers. Some workers may not be available due to intermittent network problems, or because Qserv worker crashes which are still possible under heavy loads (where the probability of the crashes have been found non-negligible). This would especially painful should this happened when running very long tests taking many hours or days before completion.

      Increasing the replication level to 2 would allow losing (temporary or permanently) a single worker w/o losing any valuable data or interrupting the tests (though with some penalty to the performance of the on-going tests). The Replication system would take care of the permanent data loss at a worker by creating extra replicas for the chunks located at the worker, and Qserv would automatically redirect on-going queries to the backup replicas of chunks that happened to be on the lost worker.

      Evaluate storage conditions at workers to allow further increasing the number of replicas to 3.

      Other notes

      All operation were initiated from the master node lsst-qserv-master01 of the "large" Qserv cluster at NCSA.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            Creating indexes on table Object

            The primary index creation operation:

            curl 'http://localhost:25081/replication/sql/index' -X POST -H "Content-Type: application/json" \
              -d@index_Object_objectId.json -oindex_Object_objectId.out >& index_Object_objectId.log&
            

            Where the configuration file index_Object_objectId.json has:

            {"database":"kpm50",
             "table":"Object",
             "overlap":0,
             "index":"object_idx_objectId",
             "spec":"UNIQUE",
             "comment":"The unique index on the object identifiers",
             "columns":[
               {"column":"objectId",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }
            

            The run time of the operation was ~1hr.
            Having multiple index creation operations (each over 150,000 tables) wouldn't rip any benefits in performance due to the limited sizes of the workers' pool that process the replication system's requests.
            Similarly two remaining indexes on the table were created:

            column index request
            psRa

            {"database":"kpm50",
             "table":"Object",
             "overlap":0,
             "index":"object_idx_psRa",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the object's spatial coordinate",
             "columns":[
               {"column":"psRa",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            psDecl

            {"database":"kpm50",
             "table":"Object",
             "overlap":0,
             "index":"object_idx_psDecl",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the object's spatial coordinate",
             "columns":[
               {"column":"psDecl",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            Show
            gapon Igor Gaponenko added a comment - - edited Creating indexes on table Object The primary index creation operation: curl 'http://localhost:25081/replication/sql/index' -X POST -H "Content-Type: application/json" \ -d@index_Object_objectId.json -oindex_Object_objectId.out >& index_Object_objectId.log& Where the configuration file index_Object_objectId.json has: {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_objectId", "spec":"UNIQUE", "comment":"The unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" } The run time of the operation was ~ 1hr . Having multiple index creation operations (each over 150,000 tables) wouldn't rip any benefits in performance due to the limited sizes of the workers' pool that process the replication system's requests. Similarly two remaining indexes on the table were created: column index request psRa {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_psRa", "spec":"DEFAULT", "comment":"The non-unique index on the object's spatial coordinate", "columns":[ {"column":"psRa", "length":0, "ascending":1 } ], "auth_key":"" } psDecl {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_psDecl", "spec":"DEFAULT", "comment":"The non-unique index on the object's spatial coordinate", "columns":[ {"column":"psDecl", "length":0, "ascending":1 } ], "auth_key":"" }
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Creating indexes on table Source

            column index request
            objectId

            {"database":"kpm50",
             "table":"Source",
             "overlap":0,
             "index":"source_idx_objectId",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the object identifiers",
             "columns":[
               {"column":"objectId",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            psRa

            {"database":"kpm50",
             "table":"Source",
             "overlap":0,
             "index":"source_idx_psRa",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the source's spatial coordinate",
             "columns":[
               {"column":"psRa",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            psDecl

            {"database":"kpm50",
             "table":"Source",
             "overlap":0,
             "index":"source_idx_psDecl",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the source's spatial coordinate",
             "columns":[
               {"column":"psDecl",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            The run time of each operation: 1h22m.

            Show
            gapon Igor Gaponenko added a comment - - edited Creating indexes on table Source column index request objectId {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_objectId", "spec":"DEFAULT", "comment":"The non-unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" } psRa {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_psRa", "spec":"DEFAULT", "comment":"The non-unique index on the source's spatial coordinate", "columns":[ {"column":"psRa", "length":0, "ascending":1 } ], "auth_key":"" } psDecl {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_psDecl", "spec":"DEFAULT", "comment":"The non-unique index on the source's spatial coordinate", "columns":[ {"column":"psDecl", "length":0, "ascending":1 } ], "auth_key":"" } The run time of each operation: 1h22m .
            Hide
            gapon Igor Gaponenko added a comment -

            Creating indexes on table ForcedSource

            column index request
            objectId

             "table":"ForcedSource",
             "overlap":0,
             "index":"forcedsource_idx_objectId",
             "spec":"DEFAULT",
             "comment":"The non-unique index on the object identifiers",
             "columns":[
               {"column":"objectId",
                "length":0,
                "ascending":1
               }
             ],
             "auth_key":""
            }

            Show
            gapon Igor Gaponenko added a comment - Creating indexes on table ForcedSource column index request objectId "table":"ForcedSource", "overlap":0, "index":"forcedsource_idx_objectId", "spec":"DEFAULT", "comment":"The non-unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" }
            Hide
            gapon Igor Gaponenko added a comment -

            Issues encountered during the index creation

            Crashes of the Master Replication Controller

            The following ticket has been registered to address this issue DM-29300.
            The crashes happened a few times with the same error reported in the log file:

            % tail /qserv/qserv-prod/replication/log/qserv-replica-master-http.log
            ..
            2021-03-18T03:49:55.647Z  LWP 543   ERROR  DatabaseServicesMySQL::saveReplicaInfoCollection failed, exception: Connection[74]::execute(_inTransaction=1)  mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205
            terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error'
              what():  Connection[74]::execute(_inTransaction=1)  mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205
            /home/qserv/lsst: line 7:   456 Aborted                 (core dumped) $@`
            

            Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout
            Possible solutions for the problem would be:

            • Increase a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service.
            • Improve the implementation of the MySQL API of the Replication system to allow automatic transaction restarts.
              The current value of the parameter is:

              SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
              +--------------------------+-------+
              | Variable_name            | Value |
              +--------------------------+-------+
              | innodb_lock_wait_timeout | 50    |
              +--------------------------+-------+
              

              That seems to be too low for cases when the Controller has to update the status of many thousand replica entries in the database due to changes in the index files (sizes, timestamps, and control sums).

            The "false-positive" error reported in the index creation service

            The following ticket has been registered to address this issue DM-29293.
            The problem was seen after retrying/resuming index creation after the above-reported crashes of the Controller:

            "Source_99895":{"request_error":"Connection[693]::execute(_inTransaction=1) mysql_real_query failed, query:
            'CREATE  INDEX `source_idx_psRa` ON `kpm50`.`Source_99895` (`psRa` ASC) COMMENT 'The non-unique index on the source\\'s spatial coordinate'',
            error: Duplicate key name 'source_idx_psRa',
            errno: 1061",
            "request_status":"EXT_STATUS_DUPLICATE_KEY"}
            

            Tough these types of errors are harmless they deceive a client to believe there was a real problem with the operation. The JIRA ticket DM-29293 was registered to address the problem for the future uses of the service.

            Show
            gapon Igor Gaponenko added a comment - Issues encountered during the index creation Crashes of the Master Replication Controller The following ticket has been registered to address this issue DM-29300 . The crashes happened a few times with the same error reported in the log file: % tail /qserv/qserv-prod/replication/log/qserv-replica-master-http.log .. 2021-03-18T03:49:55.647Z LWP 543 ERROR DatabaseServicesMySQL::saveReplicaInfoCollection failed, exception: Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error' what(): Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO `replica_file` VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 /home/qserv/lsst: line 7: 456 Aborted (core dumped) $@` Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout Possible solutions for the problem would be: Increase a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service. Improve the implementation of the MySQL API of the Replication system to allow automatic transaction restarts. The current value of the parameter is: SHOW VARIABLES LIKE 'innodb_lock_wait_timeout' ; + --------------------------+-------+ | Variable_name | Value | + --------------------------+-------+ | innodb_lock_wait_timeout | 50 | + --------------------------+-------+ That seems to be too low for cases when the Controller has to update the status of many thousand replica entries in the database due to changes in the index files (sizes, timestamps, and control sums). The "false-positive" error reported in the index creation service The following ticket has been registered to address this issue DM-29293 . The problem was seen after retrying/resuming index creation after the above-reported crashes of the Controller: "Source_99895":{"request_error":"Connection[693]::execute(_inTransaction=1) mysql_real_query failed, query: 'CREATE INDEX `source_idx_psRa` ON `kpm50`.`Source_99895` (`psRa` ASC) COMMENT 'The non-unique index on the source\\'s spatial coordinate'', error: Duplicate key name 'source_idx_psRa', errno: 1061", "request_status":"EXT_STATUS_DUPLICATE_KEY"} Tough these types of errors are harmless they deceive a client to believe there was a real problem with the operation. The JIRA ticket DM-29293 was registered to address the problem for the future uses of the service.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Increasing the minimum replication level to 2

            This first step of the operation was performed with the Master Replication Controller and the Replication workers were brought down. That was needed to reconfigure the "database family" parameters of the catalog kpm50. The catalog was ingested into the family "layout_340_3" which has many other catalogs, some of which have quite a bit of data. Bumping the replication level of the family would also result in an unnecessary increase of the disk space used by those catalogs that are still present for various testing and Qserv validation purposes. The new family named kpm was created exclusively for kpm50. The family has the same partitioning parameters as the original family "layout_340_3":

            SELECT * FROM qservReplica.config_database_family WHERE name IN ('kpm','layout_340_3');
            +--------------+-----------------------+-------------+-----------------+---------+
            | name         | min_replication_level | num_stripes | num_sub_stripes | overlap |
            +--------------+-----------------------+-------------+-----------------+---------+
            | kpm          |                     2 |         340 |               3 | 0.01667 |
            | layout_340_3 |                     1 |         340 |               3 | 0.01667 |
            +--------------+-----------------------+-------------+-----------------+---------+
            

            The database was moved to the new family using:

            SELECT * FROM qservReplica.config_database WHERE `database`='kpm50';
            +----------+--------------+--------------+--------------+------------------+
            | database | family_name  | is_published | chunk_id_key | sub_chunk_id_key |
            +----------+--------------+--------------+--------------+------------------+
            | kpm50    | layout_340_3 |            1 | chunkId      | subChunkId       |
            +----------+--------------+--------------+--------------+------------------+
             
            UPDATE qservReplica.config_database SET family_name='kpm' WHERE `database`='kpm50';
             
            SELECT * FROM qservReplica.config_database WHERE `database`='kpm50';
            +----------+-------------+--------------+--------------+------------------+
            | database | family_name | is_published | chunk_id_key | sub_chunk_id_key |
            +----------+-------------+--------------+--------------+------------------+
            | kpm50    | kpm         |            1 | chunkId      | subChunkId       |
            +----------+-------------+--------------+--------------+------------------+
            

            After that, the Master Controller and Replication workers were restarted normally. The controller's replication thread will eventually (there is a periodic interval of 20 minutes or so for checking needs in creating additional replicas in catalogs) pick up the new family and begin adding additional replicas to meet the required goal.

             Consider a possibility of adding a REST service that would implement the above-described reconfiguration of databases by moving them between the families w/o needing to restart the Controller. Make a ticket for that.

            The current replication status (shortly after starting the Controller):

            The updates status taken approximately 2 hours and 10 minutes later is showing the progress of the operation:

            According to the plot, the total run time of the operation should be 4 hours.

            Here is the I/O monitoring on the Qserv data filesystem at worker-db02.

            % iostat -m 10 | grep sdb
            Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
            sdb              46.95         3.77         4.07    7023482    7575838
            sdb            2385.20       121.85       349.18       1218       3491
            sdb            2475.90       177.58       261.85       1775       2618
            sdb            2706.00       237.18       198.83       2371       1988
            sdb            2624.70       280.63        85.69       2806        856
            sdb            2920.40       318.99        90.16       3189        901
            sdb            2734.70       275.83       127.23       2758       1272
            sdb            2989.10       344.70        56.71       3446        567
            sdb            3261.10       393.05        27.65       3930        276
            sdb            3188.10       345.07       104.77       3450       1047
            sdb            2938.50       350.37        31.90       3503        318
            sdb             368.60        28.19        35.50        281        355
            sdb             956.30        87.79        61.35        877        613
            sdb             822.40       102.64         0.00       1026          0
            sdb            1260.70       147.58        19.60       1475        196
            sdb            1509.70       173.60        27.40       1736        274
            sdb            3292.00       410.82         0.00       4108          0
            sdb            3607.20       444.20        11.78       4441        117
            sdb            1690.40       135.67       143.50       1356       1435
            sdb            2547.80       105.77       423.66       1057       4236
            sdb            2478.40       131.94       352.93       1319       3529
            sdb            3159.80       276.24       228.23       2762       2282
            sdb            2896.00       281.87       157.91       2818       1579
            sdb            2940.20       260.06       211.28       2600       2112
            sdb            2308.50        60.11       446.28        601       4462
            sdb            2220.90        63.74       425.43        637       4254
            sdb            2299.00        91.07       388.40        910       3883
            sdb            2422.20       132.33       338.80       1323       3388
            sdb            2875.20       259.43       193.98       2594       1939
            sdb            2765.00       218.15       252.81       2181       2528
            sdb            2770.70       273.87       140.66       2738       1406
            sdb            2911.60       296.53       125.75       2965       1257
            

            The operation successfully finished in about 4hrs 25min after it started.

            The snapshot above shows that there are exactly 2 replicas per chunk.
            The remaining free storage at Qserv workers is roughly 50% (using qserv-db02 as an example):

            % df -h /qserv/
            Filesystem      Size  Used Avail Use% Mounted on
            /dev/sdb1        15T  7.3T  7.4T  50% /qserv
            

            This creates a possibility for further increase of the replication level for the catalog if needed.

            Show
            gapon Igor Gaponenko added a comment - - edited Increasing the minimum replication level to 2 This first step of the operation was performed with the Master Replication Controller and the Replication workers were brought down. That was needed to reconfigure the "database family" parameters of the catalog kpm50 . The catalog was ingested into the family "layout_340_3" which has many other catalogs, some of which have quite a bit of data. Bumping the replication level of the family would also result in an unnecessary increase of the disk space used by those catalogs that are still present for various testing and Qserv validation purposes. The new family named kpm was created exclusively for kpm50 . The family has the same partitioning parameters as the original family "layout_340_3": SELECT * FROM qservReplica.config_database_family WHERE name IN ( 'kpm' , 'layout_340_3' ); + --------------+-----------------------+-------------+-----------------+---------+ | name | min_replication_level | num_stripes | num_sub_stripes | overlap | + --------------+-----------------------+-------------+-----------------+---------+ | kpm | 2 | 340 | 3 | 0.01667 | | layout_340_3 | 1 | 340 | 3 | 0.01667 | + --------------+-----------------------+-------------+-----------------+---------+ The database was moved to the new family using: SELECT * FROM qservReplica.config_database WHERE ` database `= 'kpm50' ; + ----------+--------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | + ----------+--------------+--------------+--------------+------------------+ | kpm50 | layout_340_3 | 1 | chunkId | subChunkId | + ----------+--------------+--------------+--------------+------------------+   UPDATE qservReplica.config_database SET family_name= 'kpm' WHERE ` database `= 'kpm50' ;   SELECT * FROM qservReplica.config_database WHERE ` database `= 'kpm50' ; + ----------+-------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | + ----------+-------------+--------------+--------------+------------------+ | kpm50 | kpm | 1 | chunkId | subChunkId | + ----------+-------------+--------------+--------------+------------------+ After that, the Master Controller and Replication workers were restarted normally. The controller's replication thread will eventually (there is a periodic interval of 20 minutes or so for checking needs in creating additional replicas in catalogs) pick up the new family and begin adding additional replicas to meet the required goal. ┬áConsider a possibility of adding a REST service that would implement the above-described reconfiguration of databases by moving them between the families w/o needing to restart the Controller. Make a ticket for that. The current replication status (shortly after starting the Controller): The updates status taken approximately 2 hours and 10 minutes later is showing the progress of the operation: According to the plot, the total run time of the operation should be 4 hours. Here is the I/O monitoring on the Qserv data filesystem at worker-db02 . % iostat -m 10 | grep sdb Device: tps MB_read /s MB_wrtn /s MB_read MB_wrtn sdb 46.95 3.77 4.07 7023482 7575838 sdb 2385.20 121.85 349.18 1218 3491 sdb 2475.90 177.58 261.85 1775 2618 sdb 2706.00 237.18 198.83 2371 1988 sdb 2624.70 280.63 85.69 2806 856 sdb 2920.40 318.99 90.16 3189 901 sdb 2734.70 275.83 127.23 2758 1272 sdb 2989.10 344.70 56.71 3446 567 sdb 3261.10 393.05 27.65 3930 276 sdb 3188.10 345.07 104.77 3450 1047 sdb 2938.50 350.37 31.90 3503 318 sdb 368.60 28.19 35.50 281 355 sdb 956.30 87.79 61.35 877 613 sdb 822.40 102.64 0.00 1026 0 sdb 1260.70 147.58 19.60 1475 196 sdb 1509.70 173.60 27.40 1736 274 sdb 3292.00 410.82 0.00 4108 0 sdb 3607.20 444.20 11.78 4441 117 sdb 1690.40 135.67 143.50 1356 1435 sdb 2547.80 105.77 423.66 1057 4236 sdb 2478.40 131.94 352.93 1319 3529 sdb 3159.80 276.24 228.23 2762 2282 sdb 2896.00 281.87 157.91 2818 1579 sdb 2940.20 260.06 211.28 2600 2112 sdb 2308.50 60.11 446.28 601 4462 sdb 2220.90 63.74 425.43 637 4254 sdb 2299.00 91.07 388.40 910 3883 sdb 2422.20 132.33 338.80 1323 3388 sdb 2875.20 259.43 193.98 2594 1939 sdb 2765.00 218.15 252.81 2181 2528 sdb 2770.70 273.87 140.66 2738 1406 sdb 2911.60 296.53 125.75 2965 1257 The operation successfully finished in about 4hrs 25min after it started. The snapshot above shows that there are exactly 2 replicas per chunk. The remaining free storage at Qserv workers is roughly 50% (using qserv-db02 as an example): % df -h /qserv/ Filesystem Size Used Avail Use% Mounted on /dev/sdb1 15T 7.3T 7.4T 50% /qserv This creates a possibility for further increase of the replication level for the catalog if needed.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              John Gates
              Watchers:
              Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:

                  CI Builds

                  No builds found.