Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7380

Fix missing rows in queries on the cluster

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running on the cluster, identical queries may return varying numbers of rows. This needs to be investigated and fixed.

        Attachments

          Issue Links

            Activity

            Hide
            jgates John Gates added a comment -

            Another aspect of the worker dying is that the czar was unaware and waited for jobs that would never return. I believe that a worker dying after provisioning had finished would cause QueryRequest::ProcessResponseData to return with an error. It is possible that xrootd is calling it with an error but qserv is not reading the error correctly.

            Show
            jgates John Gates added a comment - Another aspect of the worker dying is that the czar was unaware and waited for jobs that would never return. I believe that a worker dying after provisioning had finished would cause QueryRequest::ProcessResponseData to return with an error. It is possible that xrootd is calling it with an error but qserv is not reading the error correctly.
            Hide
            jgates John Gates added a comment -

            The worker deaths appear to have been caused by a race condition where a user query was deemed to be too slow by the worker and its tasks were placed on the list of tasks to be removed from a scheduler. While on that list, a needed to be removed the the scheduler because it had a large result. This task task would get cause the thread it was running on to be removed from the pool twice. To fix this, the task now resets the pointer to the PoolEventThread object so that the PooleEventThread object cannot be released a second time.

            Show
            jgates John Gates added a comment - The worker deaths appear to have been caused by a race condition where a user query was deemed to be too slow by the worker and its tasks were placed on the list of tasks to be removed from a scheduler. While on that list, a needed to be removed the the scheduler because it had a large result. This task task would get cause the thread it was running on to be removed from the pool twice. To fix this, the task now resets the pointer to the PoolEventThread object so that the PooleEventThread object cannot be released a second time.
            Hide
            salnikov Andy Salnikov added a comment -

            John, looks OK, though I do not quite understand all intricacies. Few minor comments left on PR. Igor, please mark as reviewed when you are done.

            Show
            salnikov Andy Salnikov added a comment - John, looks OK, though I do not quite understand all intricacies. Few minor comments left on PR. Igor, please mark as reviewed when you are done.
            Hide
            gapon Igor Gaponenko added a comment -

            The code looks good to me. The only minor comment I have is regarding inconsistency in injecting contexts into the message log streams.

            Show
            gapon Igor Gaponenko added a comment - The code looks good to me. The only minor comment I have is regarding inconsistency in injecting contexts into the message log streams.
            Hide
            jgates John Gates added a comment -

            Thank you Andy and Igor.

            Show
            jgates John Gates added a comment - Thank you Andy and Igor.

              People

              • Assignee:
                jgates John Gates
                Reporter:
                jgates John Gates
                Reviewers:
                Andy Salnikov, Igor Gaponenko
                Watchers:
                Andy Salnikov, Igor Gaponenko, John Gates
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel