Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-34464

Qserv query processor should retain error codes/conditions for failed queries

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The problem

      The current implementation of the query processor in Qserv czar won't retain any specific error info (but the general FAILED or ABORTED status) on the failed queries in the czar's database. The conditions are only logged into the logging stream (LSST Logger) and/or reported directly to users initiating queries. This model complicates further analysis and tracking of the failures by the Qserv data administrators and alike. This ticket is meant to address this problem.

      In the current implementation of Qserv errors are recorded in the temporary tables qservResult.message_<message-id> which are immediately deleted after reporting error conditions to users.

      Two improvements to the error processing logic are proposed below.

      Extended status codes

      In addition to the above-mentioned status codes FAILED or ABORTED, add specific codes for the following error conditions:

      • PARSER_ERROR: for queries that weren't succesfully parsed
      • LARGE_RESULT: for queries that failed due to exceeding the result set limit set in the Qserv configuration
      • COMM_ERROR: for queries failed due to persistent (non-recoverable) communication errors with the workers
      • WORKER_ERROR: for queries failed due to error conditions reported by the workers

      More error codes for the well-defined conditions could be added if needed.

      Error messages

      Extend table schema of qservMeta.QInfo by adding the text collumn error_message that will store a copy of the error messages from the tables qservResult.message_<message-id>.

      As an alternative option, investigate the possibility of keeping replacing temporary tables qservResult.message_<message-id> with the permanent ones.

      Other improvements

      It would be nice to retain:

      • the spatial coverage of the queries (the number of chunks involved)
      • the number of rows in a result set
      • the number of bytes in the result set

      The Web Dashboard will be extended to display the additional info. Igor Gaponenko will work on that.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment -

            Few comments.

            1. I have to a conclusion that given the way the timestamps are made in the code, they are totally useless in the table. It's impossible to draw any conclusions on the timing of the events based on these timestamps, and subsequently, to estimate the "root cause" of the problems by analyzing the timing of the messages. So, I suggest eliminating this column from the table.
            2. Regarding the statement that "...Adding "WARN" would require changes to smig...", this is not the case because the change will only change the new table QMessages which has not been published yet. This table only exists in this PR. So, any schema changes are (still) possible right now w/o any need to bump the schema version to the next number (above 5). Besides, I would suggest changing the type of this column from ENUM to VARCHAR(32) or something. This won't significantly change the size of the table. The main benefit of storing strings here will be the forward compatibility of the schema in case if more status will be added here.
            3. Regarding CANCELTOTAL, etc. If you insist on leaving this row in the table then why not to store the counters in this column as JSON object? Something along these lines:

              |  274430 | CANCELTOTAL |  -1 |  NULL | {"cancelled_count":1035,"completed_count":123}  | INFO     | 1652760000 |
              

              I really hate parsing arbitrary text strings when doing various reports from the database tables. Note that this table is mostly meant to be used by us (not by users) for purposes of (presumably - automated) error analysis.

            Show
            gapon Igor Gaponenko added a comment - Few comments. I have to a conclusion that given the way the timestamps are made in the code, they are totally useless in the table. It's impossible to draw any conclusions on the timing of the events based on these timestamps, and subsequently, to estimate the "root cause" of the problems by analyzing the timing of the messages. So, I suggest eliminating this column from the table. Regarding the statement that "...Adding "WARN" would require changes to smig...", this is not the case because the change will only change the new table QMessages which has not been published yet. This table only exists in this PR. So, any schema changes are (still) possible right now w/o any need to bump the schema version to the next number (above 5 ). Besides, I would suggest changing the type of this column from ENUM to VARCHAR(32) or something. This won't significantly change the size of the table. The main benefit of storing strings here will be the forward compatibility of the schema in case if more status will be added here. Regarding CANCELTOTAL , etc. If you insist on leaving this row in the table then why not to store the counters in this column as JSON object? Something along these lines: | 274430 | CANCELTOTAL | -1 | NULL | { "cancelled_count" :1035, "completed_count" :123} | INFO | 1652760000 | I really hate parsing arbitrary text strings when doing various reports from the database tables. Note that this table is mostly meant to be used by us (not by users) for purposes of (presumably - automated) error analysis.
            Hide
            gapon Igor Gaponenko added a comment -

            As you can see, the direction I'm pushing this development to is to come up with a way to design the message table allowing to store well-structured information on the problems rather than on some arbitrary "dump yard" for whatever happens in Czar. Further steps of improving error reporting and implementing error analysis tools HIGHLY depend on the structure of this table.

            Show
            gapon Igor Gaponenko added a comment - As you can see, the direction I'm pushing this development to is to come up with a way to design the message table allowing to store well-structured information on the problems rather than on some arbitrary "dump yard" for whatever happens in Czar. Further steps of improving error reporting and implementing error analysis tools HIGHLY depend on the structure of this table.
            Hide
            jgates John Gates added a comment - - edited

            Turning everything that's could possibly go wrong into a well structured tablesis really difficult if you're going to cover all the ways things could go wrong. Your test cases are different than mine, so you're seeing different things.

            Lots of these errors bubble up from the workers and are somewhat limited by what the workers can provide through the existing protocol. Condensing that information down to WORKER and giving a count is not useful nor easy to categorize, so the description there is very helpful.

            While not ideal, you can do an "order by" on the timestamp and see what order things happened pretty easily.

            That being said, I have made some changes and have some other comments:

            • I've put a limit on all msgSource type occurrences that can be put in the QMessage table. This should keep things like "COM" errors from eating up huge amounts of space as  there could be tens of thousands of them. - It will provide a count of how many "COM" messages were put in MessageStore.
              • Also, if all previous messages were INFO, and one shows up with "ERROR" it will add the ERROR as its own message. Using COM as an example, it should generate a couple of COM rows with description for the INFO errors and start counting. After a few hundred it gets a COM error, so it adds an entry for that. The QMessages table should now have a couple of rows of COM info messages, a row happening later with a COM error, and finally a total of all COM messages.
              • Adding a WARN level just makes all of this harder and I just don't see the difference between WARN and INFO being important. I don't really see why CANCEL is a WARN and not just INFO, as CANCEL is a normal system operation. ERRORs triggers things to happen that predate the changes in this ticket. See the constants.h MessageSeverity. 
                • I can't really picture anyone doing a SQL search of QMessage for WARN. I expect people to search for a specific ERROR using the description (quite possibly using LIKE) and then get all the messages for the query( or queries) that produced the error.
            • chunkId in QMessage can now be null.
            • I can change to json style for that message to json.
            • This is too much work for one ticket
              • It's possible to define other specific `msgSource` items that should only be collated. New tickets should be created to make those changes in   QMetaMysql::addQueryMessages and QMetaMysql::_addQueryMessage. For example, you could locate the result limit exceeded message, give it a specific source, say "LARGE RESULT LIMIT OF 5 GB EXCEEDED", and count those up for a message. I would not do this, however, as the maximum message size is set in a configuration file, and it would be nice to know what the limit actually is.
              • To improve error message(s) provided to the user, a new ticket should be made to modify the creation of "MULTIERROR" in `Executive::updateProxyMessages`

            Example output of worker error (with msgSource limit set at 1):

            | queryId | msgSource   | chunkId | code | message                                                                                                                                                                                                                                                                                                                         | severity | timestamp  |
            +---------+-------------+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------+
            |       1 | MERGE       |    6631 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk #6631: [1064] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1) 1652830964 | ERROR    | 1652830000 |
            |       1 | CANCELTOTAL |      -1 |    0 | CANCEL_count=7 EXECFAIL_count=4 COMPLETE_count=0                                                                                                                                                                                                                                                                                | INFO     | 1652830000 |
            |       1 | MANY_MERGE  |      -1 |    0 | Too many of msgSource MERGE to log all. Total=2                                                                                                                                                                                                                                                                                   | ERROR    | 1652830000 |
            
            

             

            Show
            jgates John Gates added a comment - - edited Turning everything that's could possibly go wrong into a well structured tablesis really difficult if you're going to cover all the ways things could go wrong. Your test cases are different than mine, so you're seeing different things. Lots of these errors bubble up from the workers and are somewhat limited by what the workers can provide through the existing protocol. Condensing that information down to WORKER and giving a count is not useful nor easy to categorize, so the description there is very helpful. While not ideal, you can do an "order by" on the timestamp and see what order things happened pretty easily. That being said, I have made some changes and have some other comments: I've put a limit on all msgSource type occurrences that can be put in the QMessage table. This should keep things like "COM" errors from eating up huge amounts of space as  there could be tens of thousands of them. - It will provide a count of how many "COM" messages were put in MessageStore. Also, if all previous messages were INFO, and one shows up with "ERROR" it will add the ERROR as its own message. Using COM as an example, it should generate a couple of COM rows with description for the INFO errors and start counting. After a few hundred it gets a COM error, so it adds an entry for that. The QMessages table should now have a couple of rows of COM info messages, a row happening later with a COM error, and finally a total of all COM messages. Adding a WARN level just makes all of this harder and I just don't see the difference between WARN and INFO being important. I don't really see why CANCEL is a WARN and not just INFO, as CANCEL is a normal system operation. ERRORs triggers things to happen that predate the changes in this ticket. See the constants.h MessageSeverity.  I can't really picture anyone doing a SQL search of QMessage for WARN. I expect people to search for a specific ERROR using the description (quite possibly using LIKE) and then get all the messages for the query( or queries) that produced the error. chunkId in QMessage can now be null. I can change to json style for that message to json. This is too much work for one ticket It's possible to define other specific `msgSource` items that should only be collated. New tickets should be created to make those changes in   QMetaMysql::addQueryMessages and QMetaMysql::_addQueryMessage. For example, you could locate the result limit exceeded message, give it a specific source, say "LARGE RESULT LIMIT OF 5 GB EXCEEDED", and count those up for a message. I would not do this, however, as the maximum message size is set in a configuration file, and it would be nice to know what the limit actually is. To improve error message(s) provided to the user, a new ticket should be made to modify the creation of "MULTIERROR" in `Executive::updateProxyMessages` Example output of worker error (with msgSource limit set at 1): | queryId | msgSource | chunkId | code | message | severity | timestamp | +---------+-------------+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------+ | 1 | MERGE | 6631 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk # 6631 : [ 1064 ] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1 ) 1652830964 | ERROR | 1652830000 | | 1 | CANCELTOTAL | - 1 | 0 | CANCEL_count= 7 EXECFAIL_count= 4 COMPLETE_count= 0 | INFO | 1652830000 | | 1 | MANY_MERGE | - 1 | 0 | Too many of msgSource MERGE to log all. Total= 2 | ERROR | 1652830000 |  
            Hide
            jgates John Gates added a comment - - edited

            Essentially the same query with msgSource limit set to 3. The number of actual MERGE errors is variable due to a race sending out and answering queries. If there were 4, there'd be a MANY_MERGE entry with 4. 

            | queryId | msgSource   | chunkId | code | message                                                                                                                                                                                                                                                                                                                         | severity | timestamp  |
            +---------+-------------+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------+
            |       1 | MERGE       |    6631 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk #6631: [1064] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1) 1652832106 | ERROR    | 1652830000 |
            |       1 | MERGE       |    6630 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk #6630: [1064] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1) 1652832106 | ERROR    | 1652830000 |
            |       1 | MERGE       |    6970 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk #6970: [1064] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1) 1652832106 | ERROR    | 1652830000 |
            |       1 | CANCELTOTAL |      -1 |    0 | {"CANCEL_count":3, "EXECFAIL_count":7, "COMPLETE_count":0}                                                                                                                                                                                                                                                                      | INFO     | 1652830000 |
            

            Show
            jgates John Gates added a comment - - edited Essentially the same query with msgSource limit set to 3. The number of actual MERGE errors is variable due to a race sending out and answering queries. If there were 4, there'd be a MANY_MERGE entry with 4.  | queryId | msgSource | chunkId | code | message | severity | timestamp | +---------+-------------+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+------------+ | 1 | MERGE | 6631 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk # 6631 : [ 1064 ] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1 ) 1652832106 | ERROR | 1652830000 | | 1 | MERGE | 6630 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk # 6630 : [ 1064 ] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1 ) 1652832106 | ERROR | 1652830000 | | 1 | MERGE | 6970 | 1210 | MERGE_ERROR 1470 (Error(s) in result for chunk # 6970 : [ 1064 ] primeResult error You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'cSELECT `qservTest_case01_qserv.Object`.`objectId` AS `objectId`,`qservTest_c...' at line 1 ) 1652832106 | ERROR | 1652830000 | | 1 | CANCELTOTAL | - 1 | 0 | { "CANCEL_count" : 3 , "EXECFAIL_count" : 7 , "COMPLETE_count" : 0 } | INFO | 1652830000 |
            Hide
            gapon Igor Gaponenko added a comment -

            Looks good to me.

            Show
            gapon Igor Gaponenko added a comment - Looks good to me.

              People

              Assignee:
              jgates John Gates
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              Igor Gaponenko
              Watchers:
              Fritz Mueller, Igor Gaponenko, John Gates
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.