Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-9321

Change qserv to use xrootd features to improve performance on interactive queries.

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Sprint:
      DB_S17_2
    • Team:
      Data Access and Database

      Description

      Call SetCBThreads to increase maximum number of threads. Call must be made before calling getservice.

      Use PRD_Hold in ProcessResponseData calls to keep responses to jobs from preventing new jobs from being sent to workers.

        Attachments

          Activity

          Hide
          jgates John Gates added a comment - - edited

          The problem this is meant to solve is that interactive queries could take an extremely long time while a large result set was being received by the czar. This is now working, so interactive only take slightly longer (a few seconds) while a large result is being received. It did not go as smoothly as I was hoping and some of the results were not what I expected.

          The original xrootd change is meant to use as few network resources as possible while throttling large results being moved from workers to the czar. The spare network resources could then be used to send out queries. This was not enough by itself. Pausing all incoming large result blocks while the jobs for a new user query are being sent to workers, and significantly reducing the size of the first result sent back by the workers for each job in a task allowed the jobs for the new user query to go out quickly.

          Since the code to send jobs out to the workers was single threaded and mixed with a significant amount of unrelated code, I pulled the code that sends the jobs out and ran it concurrently in a thread pool. I did expect a significant increase in speed. Instead, with a pool of 100 threads, it took about 25% longer to send the jobs out. Using 10 threads, it takes about the same amount of time as using a single thread. This is while running a single user query.

          Also, using SetCBThreads(3000, 300); would cause the system to be unstable. SetCBThreads(1000, 100); appears to be fine. The default was something like 300, 0.

          Aside from that, it appears to work well. Loads across the cluster remain reasonable and SELECT COUNT( * ) FROM Object; takes from 20 to 45 seconds.

          Show
          jgates John Gates added a comment - - edited The problem this is meant to solve is that interactive queries could take an extremely long time while a large result set was being received by the czar. This is now working, so interactive only take slightly longer (a few seconds) while a large result is being received. It did not go as smoothly as I was hoping and some of the results were not what I expected. The original xrootd change is meant to use as few network resources as possible while throttling large results being moved from workers to the czar. The spare network resources could then be used to send out queries. This was not enough by itself. Pausing all incoming large result blocks while the jobs for a new user query are being sent to workers, and significantly reducing the size of the first result sent back by the workers for each job in a task allowed the jobs for the new user query to go out quickly. Since the code to send jobs out to the workers was single threaded and mixed with a significant amount of unrelated code, I pulled the code that sends the jobs out and ran it concurrently in a thread pool. I did expect a significant increase in speed. Instead, with a pool of 100 threads, it took about 25% longer to send the jobs out. Using 10 threads, it takes about the same amount of time as using a single thread. This is while running a single user query. Also, using SetCBThreads(3000, 300); would cause the system to be unstable. SetCBThreads(1000, 100); appears to be fine. The default was something like 300, 0. Aside from that, it appears to work well. Loads across the cluster remain reasonable and SELECT COUNT( * ) FROM Object; takes from 20 to 45 seconds.
          Hide
          jgates John Gates added a comment -

          To keep the number of singleton/statically initialized classes down, I hung LargeResultMgr off of Czar. It seems unavoidable to have Czar be statically created and it seems like a reasonable place for such things to be located.

          The code appears to work, but in one case where several large result queries were started and the czar got bound up and failed.

          SSI V2.0 is a different API. It is critical that the logic in Executive::startAllJobs() and JobQuery::runJob() works with it, or changes will need to be made.

          Show
          jgates John Gates added a comment - To keep the number of singleton/statically initialized classes down, I hung LargeResultMgr off of Czar. It seems unavoidable to have Czar be statically created and it seems like a reasonable place for such things to be located. The code appears to work, but in one case where several large result queries were started and the czar got bound up and failed. SSI V2.0 is a different API. It is critical that the logic in Executive::startAllJobs() and JobQuery::runJob() works with it, or changes will need to be made.
          Hide
          salnikov Andy Salnikov added a comment -

          My comments are on PR, removing myself from reviewers.

          Show
          salnikov Andy Salnikov added a comment - My comments are on PR, removing myself from reviewers.
          Hide
          abh Andy Hanushevsky added a comment -

          OK, I finished my review. Looks relatively clean with some nitpicking.

          Show
          abh Andy Hanushevsky added a comment - OK, I finished my review. Looks relatively clean with some nitpicking.

            People

            Assignee:
            jgates John Gates
            Reporter:
            jgates John Gates
            Reviewers:
            Andy Hanushevsky
            Watchers:
            Andy Hanushevsky, Andy Salnikov, John Gates
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Jenkins

                No builds found.