Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-7380

Fix missing rows in queries on the cluster

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running on the cluster, identical queries may return varying numbers of rows. This needs to be investigated and fixed.

        Attachments

          Issue Links

            Activity

            Hide
            jgates John Gates added a comment -

            I've managed to reproduce the problem. It requires a significant query load (24 scans on ForcedSource and 1 scan on Object). The biggest concern to me is that "SELECT count from Object WHERE u_apFluxSigma between 2.19e-30 and 2.2e-30;" came back with a count of 7,786,625 when it should have been 19,613,100.

            Show
            jgates John Gates added a comment - I've managed to reproduce the problem. It requires a significant query load (24 scans on ForcedSource and 1 scan on Object). The biggest concern to me is that "SELECT count from Object WHERE u_apFluxSigma between 2.19e-30 and 2.2e-30;" came back with a count of 7,786,625 when it should have been 19,613,100.
            Hide
            jgates John Gates added a comment -

            Andy Hanushevsky sent the following

            Here are the directives that you may want to try if overloaded servers are causing a problem:

            cms.fxhold 168h

            cms.delay lookup 2 qdl 30

            cms.trace files

            The fxhold simply says do not discard items in the cache until they a 7 days old. I suppose we should put on the docket the option to say "never".

            The delay simply says that the cmsd will have a 30 second window (qdl) within which to establish resource location. If the location is not known when the client asks and can't be established within 178 milliseconds then the client waits for 2 seconds and reasks. If after 30 seconds there is no response then the resource does not exist or is not available. We can tune the 178 millisecond window but we need a better feel as to how many requests will hit the cmsd at the same time before we change that value otherwise we are just shooting in the dark. The 30 second window can be increased if need be in terms of qserv as we will never be looking for things that don't exist but let's start off with 30 seconds. An explanation on how all of this works can be found at:

            http://xrootd.org/doc/dev44/cms_config.htm#_Toc454223046

            (the graphic seems to be missing though, sigh).

            The trace simply says to record what you are looking for and who responds.

            All of these direcvives need only be specified for the redirector.

            For the server (worker) cmsd's it would be good to nice them up as high as you are comfortable. That would allow the kernel to dispatch the cmsd as soon as it can to answer questions. The cmsd doesn't use much resource at all so there won't be any impact.

            Now, that said, the issue here is that if there is a huge amount of I/O going on in a worker, things will not be dispatched all that quickly (at least that's what I've seen). It's also likely that the machine is being trashed and productivity is plumeting. So, it would behoove us to better control the load so we don't get into a "non-responsive" situation as at that point we can't tell if the node is broken or just slow. When you run 25 shared scans what are the load figures from "uptime"? Anything above 1 after dividing by the number of cores is worrisome. If you hit 5 you are likely going to see large delays.

            A nice discussion can be found here:
            http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

            Show
            jgates John Gates added a comment - Andy Hanushevsky sent the following Here are the directives that you may want to try if overloaded servers are causing a problem: cms.fxhold 168h cms.delay lookup 2 qdl 30 cms.trace files The fxhold simply says do not discard items in the cache until they a 7 days old. I suppose we should put on the docket the option to say "never". The delay simply says that the cmsd will have a 30 second window (qdl) within which to establish resource location. If the location is not known when the client asks and can't be established within 178 milliseconds then the client waits for 2 seconds and reasks. If after 30 seconds there is no response then the resource does not exist or is not available. We can tune the 178 millisecond window but we need a better feel as to how many requests will hit the cmsd at the same time before we change that value otherwise we are just shooting in the dark. The 30 second window can be increased if need be in terms of qserv as we will never be looking for things that don't exist but let's start off with 30 seconds. An explanation on how all of this works can be found at: http://xrootd.org/doc/dev44/cms_config.htm#_Toc454223046 (the graphic seems to be missing though, sigh). The trace simply says to record what you are looking for and who responds. All of these direcvives need only be specified for the redirector. For the server (worker) cmsd's it would be good to nice them up as high as you are comfortable. That would allow the kernel to dispatch the cmsd as soon as it can to answer questions. The cmsd doesn't use much resource at all so there won't be any impact. Now, that said, the issue here is that if there is a huge amount of I/O going on in a worker, things will not be dispatched all that quickly (at least that's what I've seen). It's also likely that the machine is being trashed and productivity is plumeting. So, it would behoove us to better control the load so we don't get into a "non-responsive" situation as at that point we can't tell if the node is broken or just slow. When you run 25 shared scans what are the load figures from "uptime"? Anything above 1 after dividing by the number of cores is worrisome. If you hit 5 you are likely going to see large delays. A nice discussion can be found here: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
            Hide
            jgates John Gates added a comment - - edited

            On the czar, I see errors like
            [2016-09-13T18:28:52.501Z] [LWP:825] ERROR qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:103) - QI=471040:3809; provisioning failed, msg=[ERROR] No more free SIDs code=14

            Show
            jgates John Gates added a comment - - edited On the czar, I see errors like [2016-09-13T18:28:52.501Z] [LWP:825] ERROR qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:103) - QI=471040:3809; provisioning failed, msg= [ERROR] No more free SIDs code=14
            Hide
            jgates John Gates added a comment -

            I believe I have found the problem. It was taking a logic path different than I thought it had taken, and throwing Bug. The message in the Bug wasn't showing up in the log in thread where it was thrown, and was well enough hidden in the log file that I did not see it. The fix is simple. I also changed Bug to log the message when it is created so it should be much easier to track things like this down in the future.

            The actual problem was markCompleted couldn't find the job in the map of incomplete jobs, and threw Bug. Not finding the job in the map is reasonable when the user query has been cancelled, so I changed the code to only throw Bug if the user query has not been cancelled.

            Cluster performance looks pretty good. I started about 25 full table scans on the cluster in about 30 seconds. Maybe 10 of them failed due to provisioning problems, but the czar remained alive. I spoon fed the failed queries back to the czar about 1 every 15 seconds. All queries completed with the correct number of rows. The load on the individual workers was pretty stable about 16, and the aggregate load of the system was about 360. 26 of the queries were joins with ForcedSource and Object, which took about 1.5 hours. 1 query was on Object and was repeated several times, taking about 8 minutes each time.

            Show
            jgates John Gates added a comment - I believe I have found the problem. It was taking a logic path different than I thought it had taken, and throwing Bug. The message in the Bug wasn't showing up in the log in thread where it was thrown, and was well enough hidden in the log file that I did not see it. The fix is simple. I also changed Bug to log the message when it is created so it should be much easier to track things like this down in the future. The actual problem was markCompleted couldn't find the job in the map of incomplete jobs, and threw Bug. Not finding the job in the map is reasonable when the user query has been cancelled, so I changed the code to only throw Bug if the user query has not been cancelled. Cluster performance looks pretty good. I started about 25 full table scans on the cluster in about 30 seconds. Maybe 10 of them failed due to provisioning problems, but the czar remained alive. I spoon fed the failed queries back to the czar about 1 every 15 seconds. All queries completed with the correct number of rows. The load on the individual workers was pretty stable about 16, and the aggregate load of the system was about 360. 26 of the queries were joins with ForcedSource and Object, which took about 1.5 hours. 1 query was on Object and was repeated several times, taking about 8 minutes each time.
            Hide
            jgates John Gates added a comment -

            The row dropping problem looks like it is fixed. Queries either come back with all rows or fail. I've had the czar crash once since putting in the change where the load on the czar's node went up to about 6000. It happened while starting 27 full scan queries and provisioning had been failing due to no SIDS. I've been unable to reproduce the crash while running the czar under gdb, and didn't see anything interesting in the logs from the crash.

            Show
            jgates John Gates added a comment - The row dropping problem looks like it is fixed. Queries either come back with all rows or fail. I've had the czar crash once since putting in the change where the load on the czar's node went up to about 6000. It happened while starting 27 full scan queries and provisioning had been failing due to no SIDS. I've been unable to reproduce the crash while running the czar under gdb, and didn't see anything interesting in the logs from the crash.
            Hide
            jgates John Gates added a comment -

            The missing rows fix still appears to work but I've found other problems. Under heavy load, 22+ full table scan queries, a worker's xrootd may die. This looks like it is caused by examineAll booting a slow user query which leads to this showing up in the core dump

            #0  __pthread_cond_destroy (cond=0x114c798) at pthread_cond_destroy.c:67
            #1  0x00007f821a59c8c2 in lsst::qserv::util::Tracker::~Tracker (this=0x114c760, __in_chrg=<optimized out>) at core/modules/util/Command.h:44
            #2  0x00007f821a1db19e in lsst::qserv::util::PoolEventThread::leavePool (this=0x114c760, cmd=std::shared_ptr (count 4, weak 0) 0x15496a0) at core/modules/util/EventThread.cc:119
            #3  0x00007f821a60cc20 in lsst::qserv::wsched::ScanScheduler::removeTask (this=0x1149740, task=std::shared_ptr (count 4, weak 0) 0x1549530) at core/modules/wsched/ScanScheduler.cc:241
            #4  0x00007f821a5df304 in lsst::qserv::wpublish::QueriesAndChunks::_bootTask (this=0x114a000, uq=std::shared_ptr (count 2, weak 0) 0x7f8064a777b0, task=std::shared_ptr (count 4, weak 0) 0x1549530, 
                sched=std::shared_ptr (count 3, weak 646) 0x1149740) at core/modules/wpublish/QueriesAndChunks.cc:361
            #5  0x00007f821a5de9ed in lsst::qserv::wpublish::QueriesAndChunks::examineAll (this=0x114a000) at core/modules/wpublish/QueriesAndChunks.cc:297
            

            It appears that a Task is being deleted, which is not supposed to happen.

            Show
            jgates John Gates added a comment - The missing rows fix still appears to work but I've found other problems. Under heavy load, 22+ full table scan queries, a worker's xrootd may die. This looks like it is caused by examineAll booting a slow user query which leads to this showing up in the core dump #0 __pthread_cond_destroy (cond=0x114c798) at pthread_cond_destroy.c:67 #1 0x00007f821a59c8c2 in lsst::qserv::util::Tracker::~Tracker (this=0x114c760, __in_chrg=<optimized out>) at core/modules/util/Command.h:44 #2 0x00007f821a1db19e in lsst::qserv::util::PoolEventThread::leavePool (this=0x114c760, cmd=std::shared_ptr (count 4, weak 0) 0x15496a0) at core/modules/util/EventThread.cc:119 #3 0x00007f821a60cc20 in lsst::qserv::wsched::ScanScheduler::removeTask (this=0x1149740, task=std::shared_ptr (count 4, weak 0) 0x1549530) at core/modules/wsched/ScanScheduler.cc:241 #4 0x00007f821a5df304 in lsst::qserv::wpublish::QueriesAndChunks::_bootTask (this=0x114a000, uq=std::shared_ptr (count 2, weak 0) 0x7f8064a777b0, task=std::shared_ptr (count 4, weak 0) 0x1549530, sched=std::shared_ptr (count 3, weak 646) 0x1149740) at core/modules/wpublish/QueriesAndChunks.cc:361 #5 0x00007f821a5de9ed in lsst::qserv::wpublish::QueriesAndChunks::examineAll (this=0x114a000) at core/modules/wpublish/QueriesAndChunks.cc:297 It appears that a Task is being deleted, which is not supposed to happen.
            Hide
            jgates John Gates added a comment -

            Another aspect of the worker dying is that the czar was unaware and waited for jobs that would never return. I believe that a worker dying after provisioning had finished would cause QueryRequest::ProcessResponseData to return with an error. It is possible that xrootd is calling it with an error but qserv is not reading the error correctly.

            Show
            jgates John Gates added a comment - Another aspect of the worker dying is that the czar was unaware and waited for jobs that would never return. I believe that a worker dying after provisioning had finished would cause QueryRequest::ProcessResponseData to return with an error. It is possible that xrootd is calling it with an error but qserv is not reading the error correctly.
            Hide
            jgates John Gates added a comment -

            The worker deaths appear to have been caused by a race condition where a user query was deemed to be too slow by the worker and its tasks were placed on the list of tasks to be removed from a scheduler. While on that list, a needed to be removed the the scheduler because it had a large result. This task task would get cause the thread it was running on to be removed from the pool twice. To fix this, the task now resets the pointer to the PoolEventThread object so that the PooleEventThread object cannot be released a second time.

            Show
            jgates John Gates added a comment - The worker deaths appear to have been caused by a race condition where a user query was deemed to be too slow by the worker and its tasks were placed on the list of tasks to be removed from a scheduler. While on that list, a needed to be removed the the scheduler because it had a large result. This task task would get cause the thread it was running on to be removed from the pool twice. To fix this, the task now resets the pointer to the PoolEventThread object so that the PooleEventThread object cannot be released a second time.
            Hide
            salnikov Andy Salnikov added a comment -

            John, looks OK, though I do not quite understand all intricacies. Few minor comments left on PR. Igor, please mark as reviewed when you are done.

            Show
            salnikov Andy Salnikov added a comment - John, looks OK, though I do not quite understand all intricacies. Few minor comments left on PR. Igor, please mark as reviewed when you are done.
            Hide
            gapon Igor Gaponenko added a comment -

            The code looks good to me. The only minor comment I have is regarding inconsistency in injecting contexts into the message log streams.

            Show
            gapon Igor Gaponenko added a comment - The code looks good to me. The only minor comment I have is regarding inconsistency in injecting contexts into the message log streams.
            Hide
            jgates John Gates added a comment -

            Thank you Andy and Igor.

            Show
            jgates John Gates added a comment - Thank you Andy and Igor.

              People

              • Assignee:
                jgates John Gates
                Reporter:
                jgates John Gates
                Reviewers:
                Andy Salnikov, Igor Gaponenko
                Watchers:
                Andy Salnikov, Igor Gaponenko, John Gates
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Summary Panel