Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-29293

Improved error processing in the table index creation operation

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Objectives

      This effort was triggered by DM-29294.

      The current implementation of the Qserv Replication/Ingest system won't automatically create table indexes at workers during ingest. This operation is designed to be done as a separate step made by catalog ingest workflows after publishing the newly ingested catalogs. There are many reasons (not to be discussed in this ticket) why this separation is necessary. The technical documentation on this subject can be found at https://confluence.lsstcorp.org/display/DM/Managing+indexes
      Specifically, there is the following REST service that is responsible for the index creation operation.

      The operation is highly parallel, and it may involve many thousand (in many cases up to 150,000) chunk tables, of which thousands would be processed simultaneously across multiple workers in the highly distributed environment of Qserv. This increases risks that the operation may fail to succeed as a whole due to low (but non-negligible) chances to fail when attempting to create indexes at individual tables. That may happen for various reasons, such as resource constraints, workers' database servers going down, the Master Replication Controller going down, etc. To mitigate this risk the index creation service was designed to be run multiple times until the operation converges. Each time the service is being invoked it instructs the Replication system's framework to go over the full set of existing chunks and attempt creating the same index. This creates a subtle problem for tables in which the desired index was created earlier during a previous run of the tool. The worker services would recognize this specific situation and return back the corresponding error code for each context (worker/database/table) where that happens. Here is an example of the errors reported in this case:

      "Source_99895":{"request_error":"Connection[693]::execute(_inTransaction=1) mysql_real_query failed, query:
      'CREATE  INDEX `source_idx_psRa` ON `kpm50`.`Source_99895` (`psRa` ASC) COMMENT 'The non-unique index on the source\\'s spatial coordinate'',
      error: Duplicate key name 'source_idx_psRa',
      errno: 1061",
      "request_status":"EXT_STATUS_DUPLICATE_KEY"}
      

      If any such response would be reported by a worker the REST service would return the "false-negative" response to a client that initiated such a request.

      A goal of this development is to improve the error handling in the REST service by adding the optional flag:

      "retry_allowed":<number>
      

      The flag would be specified in the request's body along with other parameters of the request. If the flag will be presented to the service then the service implementation will ignore the above-mentioned EXT_STATUS_DUPLICATE_KEY error code when evaluating workers' responses on the table-specific requests.

        Attachments

          Issue Links

            Activity

            There are no comments yet on this issue.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Fritz Mueller, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: