# Preparing catalog KPM50 for the performance/scalability testing of Qserv

XMLWordPrintable

#### Details

• Type: Story
• Status: In Progress
• Resolution: Unresolved
• Fix Version/s: None
• Component/s:
• Labels:
None
• Sprint:
DB_S21_12, DB_F21_06
• Team:
Data Access and Database
• Urgent?:
No

# Objectives

This effort is meant to document extra (but required) data management actions to be taken for the earlier ingested catalog KPM50 before proceeding with the tests.

## Creating table-level indexes at workers

This is needed because the current implementation of the Qserv Replication/Ingest system won't automatically create any table-level indexes neither at the super-transaction commit time nor at the catalog publishing time. Reasons for this are explained in the documentation for the Replication/Ingest system. The indexes are still required in order to achieve a reasonable performance for the queries and to set a ground for the fair comparison of the Qserv's performance against the earlier (smaller scale) KPM tests (KPM30, KPM20 and KPM10).

As a minimum, the following indexes on the chunked tables need to be created:

table overlap column unique
Object no objectId yes
Object no psRa no
Object no psDecl no
Source no objectId no
Source no psRa no
Source no psDecl no
ForcedSource no objectId no

## Increasing the minimum replication level to 2

Initially, the catalog was ingested with replication level 1. This poses the following problems:

• There is a risk of a permanent partial loss of the valuable data created in a course of a substantial (multi-weeks) effort due to filesystem failures (data corruption, etc.) or the whole worker losses. Should this happen a significant effort to regenerate the lost data and re-ingest them into the catalog will be required. This will be problematic since the current implementation of the Qserv Ingest system still doesn't support an easy way for extending or patching catalogs. This is supposed to be solved by DM-28626.
• Interruptions of the testing due to the temporary loss of single workers. Some workers may not be available due to intermittent network problems, or because Qserv worker crashes which are still possible under heavy loads (where the probability of the crashes have been found non-negligible). This would especially painful should this happened when running very long tests taking many hours or days before completion.

Increasing the replication level to 2 would allow losing (temporary or permanently) a single worker w/o losing any valuable data or interrupting the tests (though with some penalty to the performance of the on-going tests). The Replication system would take care of the permanent data loss at a worker by creating extra replicas for the chunks located at the worker, and Qserv would automatically redirect on-going queries to the backup replicas of chunks that happened to be on the lost worker.

Evaluate storage conditions at workers to allow further increasing the number of replicas to 3.

## Other notes

All operation were initiated from the master node lsst-qserv-master01 of the "large" Qserv cluster at NCSA.

#### Activity

Hide
Igor Gaponenko added a comment - - edited

# Creating indexes on table Object

The primary index creation operation:

 curl 'http://localhost:25081/replication/sql/index' -X POST -H "Content-Type: application/json" \  -d@index_Object_objectId.json -oindex_Object_objectId.out >& index_Object_objectId.log& 

Where the configuration file index_Object_objectId.json has:

 {"database":"kpm50",  "table":"Object",  "overlap":0,  "index":"object_idx_objectId",  "spec":"UNIQUE",  "comment":"The unique index on the object identifiers",  "columns":[  {"column":"objectId",  "length":0,  "ascending":1  }  ],  "auth_key":"" } 

The run time of the operation was ~1hr.
Having multiple index creation operations (each over 150,000 tables) wouldn't rip any benefits in performance due to the limited sizes of the workers' pool that process the replication system's requests.
Similarly two remaining indexes on the table were created:

column index request
psRa

 {"database":"kpm50",  "table":"Object",  "overlap":0,  "index":"object_idx_psRa",  "spec":"DEFAULT",  "comment":"The non-unique index on the object's spatial coordinate",  "columns":[  {"column":"psRa",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

psDecl

 {"database":"kpm50",  "table":"Object",  "overlap":0,  "index":"object_idx_psDecl",  "spec":"DEFAULT",  "comment":"The non-unique index on the object's spatial coordinate",  "columns":[  {"column":"psDecl",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

Show
Igor Gaponenko added a comment - - edited Creating indexes on table Object The primary index creation operation: curl 'http://localhost:25081/replication/sql/index' -X POST -H "Content-Type: application/json" \ -d@index_Object_objectId.json -oindex_Object_objectId.out >& index_Object_objectId.log& Where the configuration file index_Object_objectId.json has: {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_objectId", "spec":"UNIQUE", "comment":"The unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" } The run time of the operation was ~ 1hr . Having multiple index creation operations (each over 150,000 tables) wouldn't rip any benefits in performance due to the limited sizes of the workers' pool that process the replication system's requests. Similarly two remaining indexes on the table were created: column index request psRa {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_psRa", "spec":"DEFAULT", "comment":"The non-unique index on the object's spatial coordinate", "columns":[ {"column":"psRa", "length":0, "ascending":1 } ], "auth_key":"" } psDecl {"database":"kpm50", "table":"Object", "overlap":0, "index":"object_idx_psDecl", "spec":"DEFAULT", "comment":"The non-unique index on the object's spatial coordinate", "columns":[ {"column":"psDecl", "length":0, "ascending":1 } ], "auth_key":"" }
Hide
Igor Gaponenko added a comment - - edited

# Creating indexes on table Source

column index request
objectId

 {"database":"kpm50",  "table":"Source",  "overlap":0,  "index":"source_idx_objectId",  "spec":"DEFAULT",  "comment":"The non-unique index on the object identifiers",  "columns":[  {"column":"objectId",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

psRa

 {"database":"kpm50",  "table":"Source",  "overlap":0,  "index":"source_idx_psRa",  "spec":"DEFAULT",  "comment":"The non-unique index on the source's spatial coordinate",  "columns":[  {"column":"psRa",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

psDecl

 {"database":"kpm50",  "table":"Source",  "overlap":0,  "index":"source_idx_psDecl",  "spec":"DEFAULT",  "comment":"The non-unique index on the source's spatial coordinate",  "columns":[  {"column":"psDecl",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

The run time of each operation: 1h22m.

Show
Igor Gaponenko added a comment - - edited Creating indexes on table Source column index request objectId {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_objectId", "spec":"DEFAULT", "comment":"The non-unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" } psRa {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_psRa", "spec":"DEFAULT", "comment":"The non-unique index on the source's spatial coordinate", "columns":[ {"column":"psRa", "length":0, "ascending":1 } ], "auth_key":"" } psDecl {"database":"kpm50", "table":"Source", "overlap":0, "index":"source_idx_psDecl", "spec":"DEFAULT", "comment":"The non-unique index on the source's spatial coordinate", "columns":[ {"column":"psDecl", "length":0, "ascending":1 } ], "auth_key":"" } The run time of each operation: 1h22m .
Hide
Igor Gaponenko added a comment -

# Creating indexes on table ForcedSource

column index request
objectId

  "table":"ForcedSource",  "overlap":0,  "index":"forcedsource_idx_objectId",  "spec":"DEFAULT",  "comment":"The non-unique index on the object identifiers",  "columns":[  {"column":"objectId",  "length":0,  "ascending":1  }  ],  "auth_key":"" }

Show
Igor Gaponenko added a comment - Creating indexes on table ForcedSource column index request objectId "table":"ForcedSource", "overlap":0, "index":"forcedsource_idx_objectId", "spec":"DEFAULT", "comment":"The non-unique index on the object identifiers", "columns":[ {"column":"objectId", "length":0, "ascending":1 } ], "auth_key":"" }
Hide
Igor Gaponenko added a comment -

# Issues encountered during the index creation

## Crashes of the Master Replication Controller

The following ticket has been registered to address this issue DM-29300.
The crashes happened a few times with the same error reported in the log file:

 % tail /qserv/qserv-prod/replication/log/qserv-replica-master-http.log .. 2021-03-18T03:49:55.647Z LWP 543 ERROR DatabaseServicesMySQL::saveReplicaInfoCollection failed, exception: Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO replica_file VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error'  what(): Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO replica_file VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 /home/qserv/lsst: line 7: 456 Aborted (core dumped) $@  Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout Possible solutions for the problem would be: • Increase a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service. • Improve the implementation of the MySQL API of the Replication system to allow automatic transaction restarts. The current value of the parameter is:  SHOW VARIABLES LIKE 'innodb_lock_wait_timeout'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 50 | +--------------------------+-------+  That seems to be too low for cases when the Controller has to update the status of many thousand replica entries in the database due to changes in the index files (sizes, timestamps, and control sums). ## The "false-positive" error reported in the index creation service The following ticket has been registered to address this issue DM-29293. The problem was seen after retrying/resuming index creation after the above-reported crashes of the Controller:  "Source_99895":{"request_error":"Connection[693]::execute(_inTransaction=1) mysql_real_query failed, query: 'CREATE INDEX source_idx_psRa ON kpm50.Source_99895 (psRa ASC) COMMENT 'The non-unique index on the source\\'s spatial coordinate'', error: Duplicate key name 'source_idx_psRa', errno: 1061", "request_status":"EXT_STATUS_DUPLICATE_KEY"}  Tough these types of errors are harmless they deceive a client to believe there was a real problem with the operation. The JIRA ticket DM-29293 was registered to address the problem for the future uses of the service. Show Igor Gaponenko added a comment - Issues encountered during the index creation Crashes of the Master Replication Controller The following ticket has been registered to address this issue DM-29300 . The crashes happened a few times with the same error reported in the log file: % tail /qserv/qserv-prod/replication/log/qserv-replica-master-http.log .. 2021-03-18T03:49:55.647Z LWP 543 ERROR DatabaseServicesMySQL::saveReplicaInfoCollection failed, exception: Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO replica_file VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 terminate called after throwing an instance of 'lsst::qserv::replica::database::mysql::Error' what(): Connection[74]::execute(_inTransaction=1) mysql_real_query failed, query: 'INSERT INTO replica_file VALUES (LAST_INSERT_ID(),'ForcedSource_10218.MYI',1024,1611601585,'',0,0)', error: Lock wait timeout exceeded; try restarting transaction, errno: 1205 /home/qserv/lsst: line 7: 456 Aborted (core dumped)$@ Reasons for the crash are understood - they're caused by multiple threads of the Master Controller updating records of the same table. The problem is explained in further details at MySQL document: https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_lock_wait_timeout Possible solutions for the problem would be: Increase a value of the parameter innodb_lock_wait_timeout in the configuration of the Replication system's MariaDB service. Improve the implementation of the MySQL API of the Replication system to allow automatic transaction restarts. The current value of the parameter is: SHOW VARIABLES LIKE 'innodb_lock_wait_timeout' ; + --------------------------+-------+ | Variable_name | Value | + --------------------------+-------+ | innodb_lock_wait_timeout | 50 | + --------------------------+-------+ That seems to be too low for cases when the Controller has to update the status of many thousand replica entries in the database due to changes in the index files (sizes, timestamps, and control sums). The "false-positive" error reported in the index creation service The following ticket has been registered to address this issue DM-29293 . The problem was seen after retrying/resuming index creation after the above-reported crashes of the Controller: "Source_99895":{"request_error":"Connection[693]::execute(_inTransaction=1) mysql_real_query failed, query: 'CREATE INDEX source_idx_psRa ON kpm50.Source_99895 (psRa ASC) COMMENT 'The non-unique index on the source\\'s spatial coordinate'', error: Duplicate key name 'source_idx_psRa', errno: 1061", "request_status":"EXT_STATUS_DUPLICATE_KEY"} Tough these types of errors are harmless they deceive a client to believe there was a real problem with the operation. The JIRA ticket DM-29293 was registered to address the problem for the future uses of the service.
Hide
Igor Gaponenko added a comment - - edited

# Increasing the minimum replication level to 2

This first step of the operation was performed with the Master Replication Controller and the Replication workers were brought down. That was needed to reconfigure the "database family" parameters of the catalog kpm50. The catalog was ingested into the family "layout_340_3" which has many other catalogs, some of which have quite a bit of data. Bumping the replication level of the family would also result in an unnecessary increase of the disk space used by those catalogs that are still present for various testing and Qserv validation purposes. The new family named kpm was created exclusively for kpm50. The family has the same partitioning parameters as the original family "layout_340_3":

 SELECT * FROM qservReplica.config_database_family WHERE name IN ('kpm','layout_340_3'); +--------------+-----------------------+-------------+-----------------+---------+ | name | min_replication_level | num_stripes | num_sub_stripes | overlap | +--------------+-----------------------+-------------+-----------------+---------+ | kpm | 2 | 340 | 3 | 0.01667 | | layout_340_3 | 1 | 340 | 3 | 0.01667 | +--------------+-----------------------+-------------+-----------------+---------+ 

The database was moved to the new family using:

 SELECT * FROM qservReplica.config_database WHERE database='kpm50'; +----------+--------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | +----------+--------------+--------------+--------------+------------------+ | kpm50 | layout_340_3 | 1 | chunkId | subChunkId | +----------+--------------+--------------+--------------+------------------+   UPDATE qservReplica.config_database SET family_name='kpm' WHERE database='kpm50';   SELECT * FROM qservReplica.config_database WHERE database='kpm50'; +----------+-------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | +----------+-------------+--------------+--------------+------------------+ | kpm50 | kpm | 1 | chunkId | subChunkId | +----------+-------------+--------------+--------------+------------------+ 

After that, the Master Controller and Replication workers were restarted normally. The controller's replication thread will eventually (there is a periodic interval of 20 minutes or so for checking needs in creating additional replicas in catalogs) pick up the new family and begin adding additional replicas to meet the required goal.

Consider a possibility of adding a REST service that would implement the above-described reconfiguration of databases by moving them between the families w/o needing to restart the Controller. Make a ticket for that.

The current replication status (shortly after starting the Controller):

The updates status taken approximately 2 hours and 10 minutes later is showing the progress of the operation:

According to the plot, the total run time of the operation should be 4 hours.

Here is the I/O monitoring on the Qserv data filesystem at worker-db02.

 % iostat -m 10 | grep sdb Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 46.95 3.77 4.07 7023482 7575838 sdb 2385.20 121.85 349.18 1218 3491 sdb 2475.90 177.58 261.85 1775 2618 sdb 2706.00 237.18 198.83 2371 1988 sdb 2624.70 280.63 85.69 2806 856 sdb 2920.40 318.99 90.16 3189 901 sdb 2734.70 275.83 127.23 2758 1272 sdb 2989.10 344.70 56.71 3446 567 sdb 3261.10 393.05 27.65 3930 276 sdb 3188.10 345.07 104.77 3450 1047 sdb 2938.50 350.37 31.90 3503 318 sdb 368.60 28.19 35.50 281 355 sdb 956.30 87.79 61.35 877 613 sdb 822.40 102.64 0.00 1026 0 sdb 1260.70 147.58 19.60 1475 196 sdb 1509.70 173.60 27.40 1736 274 sdb 3292.00 410.82 0.00 4108 0 sdb 3607.20 444.20 11.78 4441 117 sdb 1690.40 135.67 143.50 1356 1435 sdb 2547.80 105.77 423.66 1057 4236 sdb 2478.40 131.94 352.93 1319 3529 sdb 3159.80 276.24 228.23 2762 2282 sdb 2896.00 281.87 157.91 2818 1579 sdb 2940.20 260.06 211.28 2600 2112 sdb 2308.50 60.11 446.28 601 4462 sdb 2220.90 63.74 425.43 637 4254 sdb 2299.00 91.07 388.40 910 3883 sdb 2422.20 132.33 338.80 1323 3388 sdb 2875.20 259.43 193.98 2594 1939 sdb 2765.00 218.15 252.81 2181 2528 sdb 2770.70 273.87 140.66 2738 1406 sdb 2911.60 296.53 125.75 2965 1257 

The operation successfully finished in about 4hrs 25min after it started.

The snapshot above shows that there are exactly 2 replicas per chunk.
The remaining free storage at Qserv workers is roughly 50% (using qserv-db02 as an example):

 % df -h /qserv/ Filesystem Size Used Avail Use% Mounted on /dev/sdb1 15T 7.3T 7.4T 50% /qserv 

This creates a possibility for further increase of the replication level for the catalog if needed.

Show
Igor Gaponenko added a comment - - edited Increasing the minimum replication level to 2 This first step of the operation was performed with the Master Replication Controller and the Replication workers were brought down. That was needed to reconfigure the "database family" parameters of the catalog kpm50 . The catalog was ingested into the family "layout_340_3" which has many other catalogs, some of which have quite a bit of data. Bumping the replication level of the family would also result in an unnecessary increase of the disk space used by those catalogs that are still present for various testing and Qserv validation purposes. The new family named kpm was created exclusively for kpm50 . The family has the same partitioning parameters as the original family "layout_340_3": SELECT * FROM qservReplica.config_database_family WHERE name IN ( 'kpm' , 'layout_340_3' ); + --------------+-----------------------+-------------+-----------------+---------+ | name | min_replication_level | num_stripes | num_sub_stripes | overlap | + --------------+-----------------------+-------------+-----------------+---------+ | kpm | 2 | 340 | 3 | 0.01667 | | layout_340_3 | 1 | 340 | 3 | 0.01667 | + --------------+-----------------------+-------------+-----------------+---------+ The database was moved to the new family using: SELECT * FROM qservReplica.config_database WHERE  database = 'kpm50' ; + ----------+--------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | + ----------+--------------+--------------+--------------+------------------+ | kpm50 | layout_340_3 | 1 | chunkId | subChunkId | + ----------+--------------+--------------+--------------+------------------+   UPDATE qservReplica.config_database SET family_name= 'kpm' WHERE  database = 'kpm50' ;   SELECT * FROM qservReplica.config_database WHERE  database = 'kpm50' ; + ----------+-------------+--------------+--------------+------------------+ | database | family_name | is_published | chunk_id_key | sub_chunk_id_key | + ----------+-------------+--------------+--------------+------------------+ | kpm50 | kpm | 1 | chunkId | subChunkId | + ----------+-------------+--------------+--------------+------------------+ After that, the Master Controller and Replication workers were restarted normally. The controller's replication thread will eventually (there is a periodic interval of 20 minutes or so for checking needs in creating additional replicas in catalogs) pick up the new family and begin adding additional replicas to meet the required goal.  Consider a possibility of adding a REST service that would implement the above-described reconfiguration of databases by moving them between the families w/o needing to restart the Controller. Make a ticket for that. The current replication status (shortly after starting the Controller): The updates status taken approximately 2 hours and 10 minutes later is showing the progress of the operation: According to the plot, the total run time of the operation should be 4 hours. Here is the I/O monitoring on the Qserv data filesystem at worker-db02 . % iostat -m 10 | grep sdb Device: tps MB_read /s MB_wrtn /s MB_read MB_wrtn sdb 46.95 3.77 4.07 7023482 7575838 sdb 2385.20 121.85 349.18 1218 3491 sdb 2475.90 177.58 261.85 1775 2618 sdb 2706.00 237.18 198.83 2371 1988 sdb 2624.70 280.63 85.69 2806 856 sdb 2920.40 318.99 90.16 3189 901 sdb 2734.70 275.83 127.23 2758 1272 sdb 2989.10 344.70 56.71 3446 567 sdb 3261.10 393.05 27.65 3930 276 sdb 3188.10 345.07 104.77 3450 1047 sdb 2938.50 350.37 31.90 3503 318 sdb 368.60 28.19 35.50 281 355 sdb 956.30 87.79 61.35 877 613 sdb 822.40 102.64 0.00 1026 0 sdb 1260.70 147.58 19.60 1475 196 sdb 1509.70 173.60 27.40 1736 274 sdb 3292.00 410.82 0.00 4108 0 sdb 3607.20 444.20 11.78 4441 117 sdb 1690.40 135.67 143.50 1356 1435 sdb 2547.80 105.77 423.66 1057 4236 sdb 2478.40 131.94 352.93 1319 3529 sdb 3159.80 276.24 228.23 2762 2282 sdb 2896.00 281.87 157.91 2818 1579 sdb 2940.20 260.06 211.28 2600 2112 sdb 2308.50 60.11 446.28 601 4462 sdb 2220.90 63.74 425.43 637 4254 sdb 2299.00 91.07 388.40 910 3883 sdb 2422.20 132.33 338.80 1323 3388 sdb 2875.20 259.43 193.98 2594 1939 sdb 2765.00 218.15 252.81 2181 2528 sdb 2770.70 273.87 140.66 2738 1406 sdb 2911.60 296.53 125.75 2965 1257 The operation successfully finished in about 4hrs 25min after it started. The snapshot above shows that there are exactly 2 replicas per chunk. The remaining free storage at Qserv workers is roughly 50% (using qserv-db02 as an example): % df -h /qserv/ Filesystem Size Used Avail Use% Mounted on /dev/sdb1 15T 7.3T 7.4T 50% /qserv This creates a possibility for further increase of the replication level for the catalog if needed.

#### People

Assignee:
Igor Gaponenko
Reporter:
Igor Gaponenko
Reviewers:
John Gates
Watchers:
Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease