Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27318

Occasional crashes of Qserv workers due to running out of system resources

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The problem

      Qserv workers have crashed on 3 occasions in the "small" Qserv cluster at NCSA. The first crash happened at worker qserv-db35 with the following info in the log file:

      [2020-10-23T18:01:06.117Z] {{LWP,194761}} INFO  xrdssi.msgs (xrootd:0) - 16997 messages lost!
      [2020-10-23T18:01:06.117Z] {{LWP,194761}} INFO  xrdssi.msgs (xrootd:0) - 2.1077:325@141.142.182.180 Ssi_WakeUp: 0:QI=310:17946;:/chk/wise_01/52714 [bound doRsp] respCBarg=1e9b014500000194
      [2020-10-23T18:01:06.117Z] {{LWP,133808}} INFO  xrdssi.msgs (xrootd:0) - 16997 messages lost!
      [2020-10-23T18:01:06.117Z] {{LWP,133808}} INFO  xrdssi.msgs (xrootd:0) - 6.1077:330@141.142.182.180 Ssi_close: QI=310:17924;:/chk/wise_01/52692 del=False
      [2020-10-23T18:01:06.117Z] {{LWP,194752}} INFO  xrdssi.msgs (xrootd:0) - 16997 messages lost!
      [2020-10-23T18:01:06.118Z] {{LWP,194752}} INFO  xrdssi.msgs (xrootd:0) - 1.1077:324@141.142.182.180 Ssi_Dispose: 0:??0?  [done odRsp] Recycling request...
      [2020-10-23T18:01:06.118Z] {{LWP,128120}} INFO  xrdssi.msgs (xrootd:0) - 16997 messages lost!
      [2020-10-23T18:01:06.118Z] {{LWP,128120}} INFO  xrdssi.msgs (xrootd:0) - 2.1077:325@141.142.182.180 Ssi_Finalize: 0:QI=310:17929;:/chk/wise_01/52697 [done odRsp] Calling Finished(False)
      terminate called after throwing an instance of 'std::system_error'
        what():  Resource temporarily unavailable
      

      The service was restarted, and Qserv got back to the normal operational mode.
      The next crash was at node qserv-db33 with the following info in the log file:

      [2020-10-24T01:47:45.145Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 29035 messages lost!
      [2020-10-24T01:47:45.145Z] {{LWP,527526}{QID,623#146284}} DEBUG util.InstanceCount (core/modules/util/InstanceCount.cc:50) - InstanceCount con Task=68172
      [2020-10-24T01:47:45.145Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 3.1077:325@141.142.182.180 Ssi_GetRequest: 0:QI=623:146284;:/chk/wise_01/227813 [begun xqReq] sz=211
      [2020-10-24T01:47:45.145Z] {{LWP,527526}{QID,623#146284}} DEBUG wbase.Task (core/modules/wbase/Task.cc:112) - Task(...) this=0x7fd9482ac420 : showing 5 of count=68172 622_10, 622_100000, 622_100007, 622_100008, 622_100012
      [2020-10-24T01:47:45.145Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 29035 messages lost!
      [2020-10-24T01:47:45.146Z] {{LWP,527526}{QID,623#146284}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:157) - BlendScheduler::queCmd
      [2020-10-24T01:47:45.146Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 3.1077:325@141.142.182.180 Ssi_BindDone: 0:QI=623:146284;:/chk/wise_01/227813 [begun xqReq] Bind called; for request 0
      [2020-10-24T01:47:45.146Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 29035 messages lost!
      [2020-10-24T01:47:45.146Z] {{LWP,527526}} INFO  xrdssi.msgs (xrootd:0) - 3.1077:325@141.142.182.180 Ssi_RelReqBuff: 0:QI=623:146284;:/chk/wise_01/227813 [bound xqReq] called
        what():  Resource temporarily unavailable
      

      And the third crash was at qserv-db35:

      [2020-10-23T20:07:44.379Z] {{LWP,320857}} DEBUG xrdsvc.SsiRequest (core/modules/xrdsvc/SsiRequest.cc:88) - GetRequest took 6e-06 seconds
      [2020-10-23T20:07:44.379Z] {{LWP,320857}} INFO  xrdssi.msgs (xrootd:0) - 2.1077:326@141.142.182.180 Ssi_GetRequest: 0:QI=372:144336;:/chk/wise_01/213543 [begun xqReq] sz=211
      [2020-10-23T20:07:44.379Z] {{LWP,320857}} DEBUG xrdsvc.SsiRequest (core/modules/xrdsvc/SsiRequest.cc:117) - Decoding TaskMsg of size 211
      [2020-10-23T20:07:44.379Z] {{LWP,320857}} INFO  xrdssi.msgs (xrootd:0) - 2446 messages lost!
        what():  Resource temporarily unavailable[2020-10-23T20:07:44.379Z] {{LWP,320857}{QID,372#144336}} DEBUG util.InstanceCount (core/modules/util/InstanceCount.cc:50) - InstanceCount con Task=55717
      [2020-10-23T20:07:44.379Z] {{LWP,320857}} INFO  xrdssi.msgs (xrootd:0) - 2.1077:326@141.142.182.180 Ssi_BindDone: 0:QI=372:144336;:/chk/wise_01/213543 [begun xqReq] Bind called; for request 0
      [2020-10-23T20:07:44.379Z] {{LWP,320857}{QID,372#144336}} DEBUG wbase.Task (core/modules/wbase/Task.cc:112) - Task(...) this=0x7f68b42bc850 : showing 5 of count=55717 361_100000, 361_100002, 361_100043, 361_100090, 361_100148
      [2020-10-23T20:07:44.379Z] {{LWP,320857}} INFO  xrdssi.msgs (xrootd:0) - 2.1077:326@141.142.182.180 Ssi_RelReqBuff: 0:QI=372:144336;:/chk/wise_01/213543 [bound xqReq] called
      [2020-10-23T20:07:44.379Z] {{LWP,320857}{QID,372#144336}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:157) - BlendScheduler::queCmd
      

      Observations and theories

      Deep inspection o fthe crash stack wasn't immediately possible due to a lack of the core files in the Qserv worker containers. In all three cases the crash was due to throwing an exception:

      what():  Resource temporarily unavailable
      

      Another observation made on the workers' log files was that at a time when the crashes happended Qserv was executing the same "shared scan" query (obtained from the Qserv czar's MySQL):

      mysql --protocol=tcp -hlocalhost -P3306 -uqsmaster  qservMeta
      

      SELECT queryId, query FROM qservMeta.QInfo WHERE queryId IN (310,372,623);
      +---------+--------------------------------------------------------+
      | queryId | query                                                  |
      +---------+--------------------------------------------------------+
      |     310 | SELECT 'monkey',COUNT(*) FROM wise_01.allwise_p3as_mep |
      |     372 | SELECT 'monkey',COUNT(*) FROM wise_01.allwise_p3as_mep |
      |     623 | SELECT 'monkey',COUNT(*) FROM wise_01.allwise_p3as_mep |
      +---------+--------------------------------------------------------+
      

      The first theory was that the system ran out of thread/process resopurces. The theory was ruled out after inspecting the coresponding system & "qserv" account settings at the qserv-db35 worker node where the first crash happened (the checkwere made from inside the conbtainwr as user "qserv"):

      % cat /proc/sys/kernel/pid_max
      1026524
      % ulimit -a
      core file size          (blocks, -c) unlimited
      data seg size           (kbytes, -d) unlimited
      scheduling priority             (-e) 0
      file size               (blocks, -f) unlimited
      pending signals                 (-i) 512735
      max locked memory       (kbytes, -l) 10485760
      max memory size         (kbytes, -m) unlimited
      open files                      (-n) 1048576
      pipe size            (512 bytes, -p) 8
      POSIX message queues     (bytes, -q) 819200
      real-time priority              (-r) 0
      stack size              (kbytes, -s) 8192
      cpu time               (seconds, -t) unlimited
      max user processes              (-u) unlimited
      virtual memory          (kbytes, -v) unlimited
      file locks                      (-x) unlimited
      

      Why the core files were not generated in Qserv workers?

      There is no obvious answer to this. By default, the core files (if enabled) into the currebt working directory of a process. In case of the crashed xrootd process of the Qserv worker, this directory was set to (from inside the running Qserv container at worker qserv-db35 where no crashes happended):

      % ps -ef | grep xroot
      qserv        679       1  0 Sep24 ?        06:57:31 /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/xrootd/lsst-dev-gdbde5f0c33/bin/xrootd -c /qserv/run/etc/lsp.cf -l @libXrdSsiLog.so -n worker -I v4 -+xrdssi /qserv/run/etc/xrdssi.cnf
      % ls -al /proc/679/cwd
      lrwxrwxrwx 1 qserv qserv 0 Oct 26 20:16 /proc/679/cwd -> /qserv/worker
      

      This folder is enabled for writing to user "qserv" from inside the container:

      % id
      uid=1000(qserv) gid=1000(qserv) groups=1000(qserv)
      % touch /qserv/worker/a.txt
      % ls -al /qserv/worker
      total 0
      drwxr-xr-x 2 qserv qserv  19 Oct 26 20:16 .
      drwxr-xr-x 1 qserv qserv 103 Sep 24 22:00 ..
      -rw-rw-r-- 1 qserv qserv   0 Oct 26 20:19 a.txt
      

      Perhaps a prolem is that the actual folder for the daemon processes is the root folder?

      /
      

      Planned investigation

      • reconfigure all nodes of the cluster to allow core files in the temporary folder /tmp like it was done earlier in the "large" Qserv cluster
      • try reproducing the problem by launching multiple "shared scan" queries to stress the service

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            Reconfigure Qserv and Replication system management tools

            This is needed to allow mounting the above-created and folder /qserv/qserv-prod/tmp/core-files into all relevant containers so that the core files were stored at that host filesystem. The additional mount point for the master and worker container has been set up at both clusters as:

            -v /qserv/qserv-prod/core-files:/tmp/core-files
            

            Changes were made to Git package: https://github.com/lsst-dm/qserv-ncsa

            Show
            gapon Igor Gaponenko added a comment - - edited Reconfigure Qserv and Replication system management tools This is needed to allow mounting the above-created and folder /qserv/qserv-prod/tmp/core-files into all relevant containers so that the core files were stored at that host filesystem. The additional mount point for the master and worker container has been set up at both clusters as: -v /qserv/qserv-prod/core-files:/tmp/core-files Changes were made to Git package: https://github.com/lsst-dm/qserv-ncsa
            Hide
            gapon Igor Gaponenko added a comment -

            Conclusions, a solution, and further actions

            The root cause of the crashes is unchecked (unconstrained) creation of threads by the worker's ThreadPool. This problem is going to be addressed in https://jira.lsstcorp.org/browse/DM-27334.

            Once the new version of Qserv will be available it would have to be re-tested, first in the "large" cluster, and then in the "small" one.

            During these tests the following extra actions should be planned:

            • monitoring memory consumption of the workers using top
            • monitoring the number of threads allocated by the workers using atop
            Show
            gapon Igor Gaponenko added a comment - Conclusions, a solution, and further actions The root cause of the crashes is unchecked (unconstrained) creation of threads by the worker's ThreadPool . This problem is going to be addressed in https://jira.lsstcorp.org/browse/DM-27334 . Once the new version of Qserv will be available it would have to be re-tested, first in the "large" cluster, and then in the "small" one. During these tests the following extra actions should be planned: monitoring memory consumption of the workers using top monitoring the number of threads allocated by the workers using atop
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Testing the new version of Qserv as per DM-27334 in the "large" cluster

            The new version of the container eas deployed in the cluster:

            qserv/qserv:tickets_DM-27334
            

            The new version has been tested by launching 4, 8, 16 and up to 32 parallel queries of the following kind:

            for ra_max in $(seq --format="0.%02.0f" 1 4); do
                mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT MAX(ra) FROM wise_00.allwise_p3as_mep WHERE ra < ${ra_max}"&
            done
            for ra_max in $(seq --format="0.%02.0f" 1 4); do
                mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT ra FROM wise_00.allwise_p3as_psd WHERE ra < ${ra_max}" >& /dev/null&
            done
            

            Noothing unsual, any performance degradation, or exessessive resource consuption was observed during these tests. All successfully queries finished.

            Show
            gapon Igor Gaponenko added a comment - - edited Testing the new version of Qserv as per DM-27334 in the "large" cluster The new version of the container eas deployed in the cluster: qserv/qserv:tickets_DM-27334 The new version has been tested by launching 4, 8, 16 and up to 32 parallel queries of the following kind: for ra_max in $( seq -- format = "0.%02.0f" 1 4); do mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT MAX(ra) FROM wise_00.allwise_p3as_mep WHERE ra < ${ra_max}" & done for ra_max in $( seq -- format = "0.%02.0f" 1 4); do mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT ra FROM wise_00.allwise_p3as_psd WHERE ra < ${ra_max}" >& /dev/null & done Noothing unsual, any performance degradation, or exessessive resource consuption was observed during these tests. All successfully queries finished.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Testing the new version of Qserv as per DM-27334 in the "small" cluster

            The new version of the container has been deployed in the cluster:

            qserv/qserv:tickets_DM-27334
            

            The new version has been tested by launching 4 parallel queries of the following kind:

            for ra_max in $(seq --format="0.%02.0f" 1 4); do
                mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT 'stress-test-for-core-dump',COUNT(*) FROM wise_01.allwise_p3as_mep"&
            done
            

            PROBLEM; all queries failed because all 6 workers died.

            Investigating worker crashes

            Started with worker qserv-db31. The core file was found at the host filesystem as it was configured earlier today:

            % ls -l /qserv/qserv-prod/core-files/
            total 5142200
            -rw------- 1 qserv qserv 32081813504 Oct 29 18:38 core.764
            

            Unfortunately, the log files of the xrootd process had nothing.

            However, the log file of worked qserv-db32 has this:

            tail /qserv/qserv-prod/log/xrootd.log
            [2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO  xrdssi.msgs (xrootd:0) - ?R?_[2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO  xrdssi.msgs (xrootd:0) - 6.1078:327@141.142.182.180 Ssi_open: QI=7105:31378;:/chk/wise_01/70986 prepared.
            [2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103#117189}} INFO  wsched.GroupScheduler (core/modules/wsched/GroupScheduler.cc:123) - SchedGroup queCmd uqCount=19272 rating=0 interactive=0
            [2020-10-29T23:38:56.586Z] {{LWP,13164}{QID,7104#27648}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:167) - Blend chose group scanTables.size=0 interactive=0
            [2020-10-29T23:38:56.586Z] {{LWP,13164}{QID,7104#27648}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:210) - Blend queCmd
            [2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103#117189}} DEBUG util.Timer (core/modules/util/Timer.cc:84) - lockTime BlendScheduler::queCmd b toLock=10.7812 held=0.002949
            terminate called after throwing an instance of 'std::system_error[2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103#117189}} DEBUG xrdsvc.SsiRequest (core/modules/xrdsvc/SsiRequest.cc:149) - Enqueued TaskMsg for Resource(/chk/wise_01/162840) in 10.7844 seconds
            '
            [2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO  xrdssi.msgs (xrootd:0) - ?R?_[2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO  xrdssi.msgs (xrootd:0) - 4.1078:328@141.142.182.180 Ssi_open: QI=7105:31316;:/chk/wise_01/70924 prepared.
            [2020-10-29T23:38:56.586Z] {{LWP,6449}} DEBUG xrdsvc.SsiProvider (core/modules/xrdsvc/SsiProvider.cc:165) - SsiProvider Query /chk/wise_01/152827 present
              what():  Resource temporarily unavailable
            

            CLEARLY this is the same problem as before.
            Worker qserv-db32 also has the core file at:

            ls -l  /qserv/qserv-prod/core-files/
            total 5124820
            -rw------- 1 qserv qserv 32105029632 Oct 29 18:39 core.759
            

            Looking at the core file with gdb. Log into the container:

            docker exec -it qserv bash
            

            Install gdb:

            source /qserv/stack/loadLSST.bash 
            setup -t qserv-dev qserv_distrib
            conda install gdb
            

            Then:

            gdb \
              /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/xrootd/lsst-dev-gdbde5f0c33/bin/xrootd \
              /tmp/core-files/core.759
            

            The tool reported multiple open threads (probably on a scale of 30k):

            [New LWP 6021]
            [New LWP 9662]
            ...
            

            The crash stack looks the same as before:

            Program terminated with signal SIGABRT, Aborted.
            #0  0x00007fc214386337 in raise () from /lib64/libc.so.6
            [Current thread is 1 (Thread 0x7fbf74441700 (LWP 13163))]
            (gdb) where
            #0  0x00007fc214386337 in raise () from /lib64/libc.so.6
            #1  0x00007fc214387a28 in abort () from /lib64/libc.so.6
            #2  0x00007fc2147cb81c in __gnu_cxx::__verbose_terminate_handler ()
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
            #3  0x00007fc2147c9f19 in __cxxabiv1::__terminate (handler=<optimized out>)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
            #4  0x00007fc2147c9f4f in std::terminate ()
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
            #5  0x00007fc2147ca12c in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fc0103ac8e0, tinfo=tinfo@entry=0x7fc214884a50 <typeinfo for std::system_error>, 
                dest=dest@entry=0x7fc2147e5f9a <std::system_error::~system_error()>)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
            #6  0x00007fc2147c72c3 in std::__throw_system_error (__i=11)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:89
            #7  0x00007fc2147e6300 in std::thread::_M_start_thread (this=this@entry=0x7fbf7443f3c8, state=...)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:139
            #8  0x00007fc201ff016d in std::thread::thread<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__f=<optimized out>, this=0x7fbf7443f3c8)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:126
            #9  lsst::qserv::util::EventThread::run (this=0x7fc0103abd30) at core/modules/util/EventThread.cc:78
            #10 0x00007fc202006b20 in lsst::qserv::util::ThreadPool::_resize (this=this@entry=0x562e7e83e4f0) at core/modules/util/ThreadPool.cc:265
            #11 0x00007fc20200a821 in lsst::qserv::util::ThreadPool::release (this=0x562e7e83e4f0, thrd=thrd@entry=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:224
            #12 0x00007fc20200b052 in lsst::qserv::util::PoolEventThread::finishup (this=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:124
            #13 0x00007fc202004906 in lsst::qserv::util::PoolEventThread::leavePool (this=this@entry=0x7fc0cc13e600, cmd=...) at core/modules/util/ThreadPool.cc:101
            #14 0x00007fc2020049dd in lsst::qserv::util::PoolEventThread::leavePool (this=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:112
            #15 0x00007fc2020bd3f7 in lsst::qserv::wdb::QueryRunner::_fillRows (this=this@entry=0x7fc0103a2fb0, result=result@entry=0x7fc0103abbd0, numFields=numFields@entry=2, 
                rowCount=@0x7fbf7444068c: 1, tSize=@0x7fbf74440680: 39) at core/modules/wdb/QueryRunner.cc:273
            #16 0x00007fc2020beaa3 in lsst::qserv::wdb::QueryRunner::_dispatchChannel (this=this@entry=0x7fc0103a2fb0) at core/modules/wdb/QueryRunner.cc:485
            #17 0x00007fc2020bffba in lsst::qserv::wdb::QueryRunner::runQuery (this=0x7fc0103a2fb0) at core/modules/wdb/QueryRunner.cc:188
            #18 0x00007fc2020a6757 in lsst::qserv::wcontrol::Foreman::<lambda(lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData *) const (__closure=0x7fc154053520)
                at core/modules/wcontrol/Foreman.cc:115
            #19 0x00007fc2020a77af in std::_Function_handler<void(lsst::qserv::util::CmdData*), lsst::qserv::wcontrol::Foreman::processTask(const std::shared_ptr<lsst::qserv::wbase::Task>&)::<lambda(lsst::qserv::util::CmdData*)> >::_M_invoke(const std::_Any_data &, lsst::qserv::util::CmdData *&&) (__functor=..., __args#0=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:316
            #20 0x00007fc2020a444c in std::function<void (lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData*) const (this=<optimized out>, __args#0=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:706
            #21 0x00007fc2020a4463 in lsst::qserv::util::Command::action (this=<optimized out>, data=<optimized out>) at core/modules/util/Command.h:77
            #22 0x00007fc201ff0049 in lsst::qserv::util::Command::runAction (data=0x7fc0cc13e600, this=<optimized out>) at core/modules/util/Command.h:81
            #23 lsst::qserv::util::EventThread::handleCmds (this=0x7fc0cc13e600) at core/modules/util/EventThread.cc:55
            #24 0x00007fc201ff25d2 in std::__invoke_impl<void, void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__t=<optimized out>, __f=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:73
            #25 std::__invoke<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__fn=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:95
            #26 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::_M_invoke<0ul, 1ul> (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:234
            #27 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::operator() (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:243
            #28 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> > >::_M_run (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:186
            #29 0x00007fc2147e6163 in std::execute_native_thread_routine (__p=0x7fc0cc13d7f0)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
            #30 0x00007fc214899e65 in start_thread () from /lib64/libpthread.so.0
            #31 0x00007fc21444e88d in clone () from /lib64/libc.so.6
            

            Inspecting a configuration file of the worker at:

            % cat /qserv/run/etc/xrdssi.cnf
            ...
            [scheduler]
             
            # Thread pool size
            thread_pool_size = 20
            max_pool_threads = 10000
             
            ...
            

            The configuration looks right.
            Unfortunately noting was found in the log file:

            tail -100000 run/var/log/xrootd.log | grep '_resize target'
            

            Further actions

            Set log level to TRACE at worker qserv-db32 Logger configuration, restarted the worker and tested the log:

            % run/etc/init.d/xrootd restart
            % tail -100000 run/var/log/xrootd.log | grep '_resize target'
            [2020-10-30T00:27:21.782Z] {{LWP,37448}} TRACE util.ThreadPool (core/modules/util/ThreadPool.cc:279) - _resize target=20 size=20
            [2020-10-30T00:27:21.784Z] {{LWP,37448}} TRACE util.ThreadPool (core/modules/util/ThreadPool.cc:279) - _resize target=20 size=20
            

            Fiurther tests with log files inspection demonstrated that ThreadPool never exceeds the specified limit. Tehse are the highest numeb rs of threads reported fo reach worker:

            qserv-db31: 3652
            qserv-db32: 2109
            qserv-db33: 3265
            qserv-db34:  861
            qserv-db35: 2545
            qserv-db36: 2265
            

            Still, workers kept crashing.

            CONCLUSON: xrootd is teh main suspect.

            Investigating and tuning up XRootD

            An XRootD expert confirmed that XROOTD may unfder certain conditions consume that many threads. A solution was to configure XROOT services with the following option explained in https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911324:

            xrd.sched mint 8 maxt 2048 avlt 512 idle 780
            

            This configuration was put manually into each worker's container at:

            % tail /qserv/run/etc/lsp.cf 
            ...
            # Set limits for the number of threads open by XRootD services
            # See more details on these parameters at:
            # https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911324 
            xrd.sched mint 8 maxt 2048 avlt 512 idle 780
            

            Then worker services were restarted and 16 queries of the following kind launched:

            for i in $(seq 1 16); do
                mysql --protocol=tcp -P4040 -uqsmaster 
                -e "SELECT 'stress-test-for-core-dump',COUNT(*) FROM wise_01.allwise_p3as_mep"&
            done
            

            All queries successfully finished.

            Show
            gapon Igor Gaponenko added a comment - - edited Testing the new version of Qserv as per DM-27334 in the "small" cluster The new version of the container has been deployed in the cluster: qserv/qserv:tickets_DM-27334 The new version has been tested by launching 4 parallel queries of the following kind: for ra_max in $( seq -- format = "0.%02.0f" 1 4); do mysql --protocol=tcp -hlocalhost -P4040 -uqsmaster -e "SELECT 'stress-test-for-core-dump',COUNT(*) FROM wise_01.allwise_p3as_mep" & done PROBLEM ; all queries failed because all 6 workers died. Investigating worker crashes Started with worker qserv-db31 . The core file was found at the host filesystem as it was configured earlier today: % ls -l /qserv/qserv-prod/core-files/ total 5142200 -rw------- 1 qserv qserv 32081813504 Oct 29 18:38 core.764 Unfortunately, the log files of the xrootd process had nothing. However, the log file of worked qserv-db32 has this: tail /qserv/qserv-prod/log/xrootd .log [2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO xrdssi.msgs (xrootd:0) - ?R?_[2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO xrdssi.msgs (xrootd:0) - 6.1078:327@141.142.182.180 Ssi_open: QI=7105:31378;: /chk/wise_01/70986 prepared. [2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103 #117189}} INFO wsched.GroupScheduler (core/modules/wsched/GroupScheduler.cc:123) - SchedGroup queCmd uqCount=19272 rating=0 interactive=0 [2020-10-29T23:38:56.586Z] {{LWP,13164}{QID,7104 #27648}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:167) - Blend chose group scanTables.size=0 interactive=0 [2020-10-29T23:38:56.586Z] {{LWP,13164}{QID,7104 #27648}} DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:210) - Blend queCmd [2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103 #117189}} DEBUG util.Timer (core/modules/util/Timer.cc:84) - lockTime BlendScheduler::queCmd b toLock=10.7812 held=0.002949 terminate called after throwing an instance of 'std::system_error[2020-10-29T23:38:56.586Z] {{LWP,6449}{QID,7103 #117189}} DEBUG xrdsvc.SsiRequest (core/modules/xrdsvc/SsiRequest.cc:149) - Enqueued TaskMsg for Resource(/chk/wise_01/162840) in 10.7844 seconds ' [2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO xrdssi.msgs (xrootd:0) - ?R?_[2020-10-29T23:38:56.586Z] {{LWP,13142}} INFO xrdssi.msgs (xrootd:0) - 4.1078:328@141.142.182.180 Ssi_open: QI=7105:31316;: /chk/wise_01/70924 prepared. [2020-10-29T23:38:56.586Z] {{LWP,6449}} DEBUG xrdsvc.SsiProvider (core /modules/xrdsvc/SsiProvider .cc:165) - SsiProvider Query /chk/wise_01/152827 present what(): Resource temporarily unavailable CLEARLY this is the same problem as before. Worker qserv-db32 also has the core file at: ls -l /qserv/qserv-prod/core-files/ total 5124820 -rw------- 1 qserv qserv 32105029632 Oct 29 18:39 core.759 Looking at the core file with gdb . Log into the container: docker exec -it qserv bash Install gdb : source /qserv/stack/loadLSST.bash setup -t qserv-dev qserv_distrib conda install gdb Then: gdb \ /qserv/stack/stack/miniconda3-py37_4 .8.2-ceb6bb6 /Linux64/xrootd/lsst-dev-gdbde5f0c33/bin/xrootd \ /tmp/core-files/core .759 The tool reported multiple open threads (probably on a scale of 30k ): [New LWP 6021] [New LWP 9662] ... The crash stack looks the same as before: Program terminated with signal SIGABRT, Aborted. #0 0x00007fc214386337 in raise () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7fbf74441700 (LWP 13163))] (gdb) where #0 0x00007fc214386337 in raise () from /lib64/libc.so.6 #1 0x00007fc214387a28 in abort () from /lib64/libc.so.6 #2 0x00007fc2147cb81c in __gnu_cxx::__verbose_terminate_handler () at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007fc2147c9f19 in __cxxabiv1::__terminate (handler=<optimized out>) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47 #4 0x00007fc2147c9f4f in std::terminate () at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57 #5 0x00007fc2147ca12c in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fc0103ac8e0, tinfo=tinfo@entry=0x7fc214884a50 <typeinfo for std::system_error>, dest=dest@entry=0x7fc2147e5f9a <std::system_error::~system_error()>) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95 #6 0x00007fc2147c72c3 in std::__throw_system_error (__i=11) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:89 #7 0x00007fc2147e6300 in std::thread::_M_start_thread (this=this@entry=0x7fbf7443f3c8, state=...) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:139 #8 0x00007fc201ff016d in std::thread::thread<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__f=<optimized out>, this=0x7fbf7443f3c8) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:126 #9 lsst::qserv::util::EventThread::run (this=0x7fc0103abd30) at core/modules/util/EventThread.cc:78 #10 0x00007fc202006b20 in lsst::qserv::util::ThreadPool::_resize (this=this@entry=0x562e7e83e4f0) at core/modules/util/ThreadPool.cc:265 #11 0x00007fc20200a821 in lsst::qserv::util::ThreadPool::release (this=0x562e7e83e4f0, thrd=thrd@entry=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:224 #12 0x00007fc20200b052 in lsst::qserv::util::PoolEventThread::finishup (this=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:124 #13 0x00007fc202004906 in lsst::qserv::util::PoolEventThread::leavePool (this=this@entry=0x7fc0cc13e600, cmd=...) at core/modules/util/ThreadPool.cc:101 #14 0x00007fc2020049dd in lsst::qserv::util::PoolEventThread::leavePool (this=0x7fc0cc13e600) at core/modules/util/ThreadPool.cc:112 #15 0x00007fc2020bd3f7 in lsst::qserv::wdb::QueryRunner::_fillRows (this=this@entry=0x7fc0103a2fb0, result=result@entry=0x7fc0103abbd0, numFields=numFields@entry=2, rowCount=@0x7fbf7444068c: 1, tSize=@0x7fbf74440680: 39) at core/modules/wdb/QueryRunner.cc:273 #16 0x00007fc2020beaa3 in lsst::qserv::wdb::QueryRunner::_dispatchChannel (this=this@entry=0x7fc0103a2fb0) at core/modules/wdb/QueryRunner.cc:485 #17 0x00007fc2020bffba in lsst::qserv::wdb::QueryRunner::runQuery (this=0x7fc0103a2fb0) at core/modules/wdb/QueryRunner.cc:188 #18 0x00007fc2020a6757 in lsst::qserv::wcontrol::Foreman::<lambda(lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData *) const (__closure=0x7fc154053520) at core/modules/wcontrol/Foreman.cc:115 #19 0x00007fc2020a77af in std::_Function_handler<void(lsst::qserv::util::CmdData*), lsst::qserv::wcontrol::Foreman::processTask(const std::shared_ptr<lsst::qserv::wbase::Task>&)::<lambda(lsst::qserv::util::CmdData*)> >::_M_invoke(const std::_Any_data &, lsst::qserv::util::CmdData *&&) (__functor=..., __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:316 #20 0x00007fc2020a444c in std::function<void (lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData*) const (this=<optimized out>, __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:706 #21 0x00007fc2020a4463 in lsst::qserv::util::Command::action (this=<optimized out>, data=<optimized out>) at core/modules/util/Command.h:77 #22 0x00007fc201ff0049 in lsst::qserv::util::Command::runAction (data=0x7fc0cc13e600, this=<optimized out>) at core/modules/util/Command.h:81 #23 lsst::qserv::util::EventThread::handleCmds (this=0x7fc0cc13e600) at core/modules/util/EventThread.cc:55 #24 0x00007fc201ff25d2 in std::__invoke_impl<void, void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__t=<optimized out>, __f=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:73 #25 std::__invoke<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__fn=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:95 #26 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::_M_invoke<0ul, 1ul> (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:234 #27 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::operator() (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:243 #28 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> > >::_M_run (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:186 #29 0x00007fc2147e6163 in std::execute_native_thread_routine (__p=0x7fc0cc13d7f0) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #30 0x00007fc214899e65 in start_thread () from /lib64/libpthread.so.0 #31 0x00007fc21444e88d in clone () from /lib64/libc.so.6 Inspecting a configuration file of the worker at: % cat /qserv/run/etc/xrdssi .cnf ... [scheduler]   # Thread pool size thread_pool_size = 20 max_pool_threads = 10000   ... The configuration looks right. Unfortunately noting was found in the log file: tail -100000 run /var/log/xrootd .log | grep '_resize target' Further actions Set log level to TRACE at worker qserv-db32 Logger configuration, restarted the worker and tested the log: % run/etc/init.d/xrootd restart % tail -100000 run/var/log/xrootd.log | grep '_resize target' [2020-10-30T00:27:21.782Z] {{LWP,37448}} TRACE util.ThreadPool (core/modules/util/ThreadPool.cc:279) - _resize target=20 size=20 [2020-10-30T00:27:21.784Z] {{LWP,37448}} TRACE util.ThreadPool (core/modules/util/ThreadPool.cc:279) - _resize target=20 size=20 Fiurther tests with log files inspection demonstrated that ThreadPool never exceeds the specified limit. Tehse are the highest numeb rs of threads reported fo reach worker: qserv-db31: 3652 qserv-db32: 2109 qserv-db33: 3265 qserv-db34: 861 qserv-db35: 2545 qserv-db36: 2265 Still, workers kept crashing. CONCLUSON : xrootd is teh main suspect. Investigating and tuning up XRootD An XRootD expert confirmed that XROOTD may unfder certain conditions consume that many threads. A solution was to configure XROOT services with the following option explained in https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911324: xrd.sched mint 8 maxt 2048 avlt 512 idle 780 This configuration was put manually into each worker's container at: % tail /qserv/run/etc/lsp .cf ... # Set limits for the number of threads open by XRootD services # See more details on these parameters at: # https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911324 xrd.sched mint 8 maxt 2048 avlt 512 idle 780 Then worker services were restarted and 16 queries of the following kind launched: for i in $( seq 1 16); do mysql --protocol=tcp -P4040 -uqsmaster -e "SELECT 'stress-test-for-core-dump',COUNT(*) FROM wise_01.allwise_p3as_mep" & done All queries successfully finished.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Further actions

            • lower the default limit for the number of threads open by ThreadPool down to 3000
            • update the configuration file template for the number of the xrootd threads as shown above
            • give the "green light" to the branch developer to PR the ticket https://jira.lsstcorp.org/browse/DM-27334
            Show
            gapon Igor Gaponenko added a comment - - edited Further actions lower the default limit for the number of threads open by ThreadPool down to 3000 update the configuration file template for the number of the xrootd threads as shown above give the "green light" to the branch developer to PR the ticket https://jira.lsstcorp.org/browse/DM-27334

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              Fritz Mueller
              Watchers:
              Andy Hanushevsky, Fritz Mueller, Igor Gaponenko, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: