Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-30098

Spurious infrequent crashes at Qserv workers

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The scope

      The ticket captures the low-frequency spurious crashes in the Qserv workers. The main purpose of the ticket is to collects statistics on the crashes in case if there will be any clear pattern seen over time.

      Each report shall be supported by the following information:

      • The date & time
      • Qserv version (branch, git hash, etc.)
      • The name of a cluster
      • The name of a worker
      • Crash stack (if available), a location of the core file
      • The relevant excerpts from the log files
      • Any other information that may be useful for further analysis

      Known issues

      At the time this ticket was filed, there was an ongoing investigation of crashes in serializing the JSON objects into the Protobuf messages: DM-26574.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment - - edited

            2021-04-11: The "small" cluster at NCSA, worker db34

            The crash was only noticed on 2021-05-07. Qserv kept functioning for alost a month after the crash due to the Replication system maintaining 2 replicas of each chunk. It was only noticed after another worker db33 got crashed and reported in the next comment.

            Qserv version qserv/qserv:deps_20201210_0904
            Replication system's version qserv/replica:tools-DM-27681
            Qserv start time (from the log file) 2021-04-08 15:53:04
            Qserv crash time (from the logfile) 2021-04-11T05:17:22
            Logfile location /qserv/qserv-prod/log/logs.20210507.1/xrootd.log
            Core file location (the host OS filesystem) /qserv/qserv-prod/core-files/core.224556

            Crash stack:

            Program terminated with signal SIGABRT, Aborted.
            #0  0x00007f0469fc7337 in raise () from /lib64/libc.so.6
            [Current thread is 1 (Thread 0x7efe8f186700 (LWP 857256))]
            (gdb) where
            #0  0x00007f0469fc7337 in raise () from /lib64/libc.so.6
            #1  0x00007f0469fc8a28 in abort () from /lib64/libc.so.6
            #2  0x00007f046a009e87 in __libc_message () from /lib64/libc.so.6
            #3  0x00007f046a012679 in _int_free () from /lib64/libc.so.6
            #4  0x00007f0454bf34c1 in __gnu_cxx::new_allocator<char>::deallocate (__p=<optimized out>, this=0x7f03387e9160)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/ext/new_allocator.h:125
            #5  std::allocator_traits<std::allocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/alloc_traits.h:462
            #6  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy (__size=<optimized out>, this=0x7f03387e9160)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:226
            #7  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose (this=0x7f03387e9160)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:221
            #8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string (this=0x7f03387e9160, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:647
            #9  lsst::qserv::mysql::MySqlConfig::~MySqlConfig (this=0x7f03387e9120, __in_chrg=<optimized out>) at core/modules/mysql/MySqlConfig.h:50
            #10 lsst::qserv::xrdsvc::SsiRequest::~SsiRequest (this=0x7f03387e9030, __in_chrg=<optimized out>) at core/modules/xrdsvc/SsiRequest.cc:65
            #11 0x00007f0454bf3d99 in lsst::qserv::xrdsvc::SsiRequest::~SsiRequest (this=0x7f03387e9030, __in_chrg=<optimized out>) at core/modules/xrdsvc/SsiRequest.cc:68
            #12 0x00007f0454c07eff in std::_Sp_counted_ptr<lsst::qserv::xrdsvc::SsiRequest*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:376
            #13 0x00007f0454b25c4f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f033857aaf0)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154
            #14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f0338c5dd50, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:684
            #15 std::__shared_ptr<lsst::qserv::xrdsvc::SsiRequest, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f0338c5dd48, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1123
            #16 std::shared_ptr<lsst::qserv::xrdsvc::SsiRequest>::~shared_ptr (this=0x7f0338c5dd48, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr.h:93
            #17 lsst::qserv::wbase::SendChannel::~SendChannel (this=0x7f0338c5dd20, __in_chrg=<optimized out>) at core/modules/wbase/SendChannel.h:55
            #18 0x00007f0454c000a9 in __gnu_cxx::new_allocator<lsst::qserv::wbase::SendChannel>::destroy<lsst::qserv::wbase::SendChannel> (__p=<optimized out>, this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/ext/new_allocator.h:140
            #19 std::allocator_traits<std::allocator<lsst::qserv::wbase::SendChannel> >::destroy<lsst::qserv::wbase::SendChannel> (__p=<optimized out>, __a=...)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/alloc_traits.h:487
            #20 std::_Sp_counted_ptr_inplace<lsst::qserv::wbase::SendChannel, std::allocator<lsst::qserv::wbase::SendChannel>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:535
            #21 0x00007f0454b2f4ea in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f0338c5dd10)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154
            #22 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:684
            #23 std::__shared_ptr<lsst::qserv::wbase::SendChannel, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1123
            #24 std::__shared_ptr<lsst::qserv::wbase::SendChannel, (__gnu_cxx::_Lock_policy)2>::reset (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1235
            #25 lsst::qserv::wcontrol::Foreman::<lambda(lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData *) const (__closure=<optimized out>) at core/modules/wcontrol/Foreman.cc:124
            #26 0x00007f0454b2f7af in std::_Function_handler<void(lsst::qserv::util::CmdData*), lsst::qserv::wcontrol::Foreman::processTask(const std::shared_ptr<lsst::qserv::wbase::Task>&)::<lambda(lsst::qserv::util::CmdData*)> >::_M_invoke(const std::_Any_data &, lsst::qserv::util::CmdData *&&) (__functor=..., __args#0=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:316
            #27 0x00007f0454b2c44c in std::function<void (lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData*) const (this=<optimized out>, __args#0=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:706
            #28 0x00007f0454b2c463 in lsst::qserv::util::Command::action (this=<optimized out>, data=<optimized out>) at core/modules/util/Command.h:77
            #29 0x00007f0454a78049 in lsst::qserv::util::Command::runAction (data=0x7f04242f2a40, this=<optimized out>) at core/modules/util/Command.h:81
            #30 lsst::qserv::util::EventThread::handleCmds (this=0x7f04242f2a40) at core/modules/util/EventThread.cc:55
            #31 0x00007f0454a7a5d2 in std::__invoke_impl<void, void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__t=<optimized out>, __f=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:73
            #32 std::__invoke<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__fn=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:95
            #33 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::_M_invoke<0ul, 1ul> (this=<optimized out>)
            --Type <RET> for more, q to quit, c to continue without paging--
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:234
            #34 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::operator() (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:243
            #35 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> > >::_M_run (this=<optimized out>)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:186
            #36 0x00007f046a427163 in std::execute_native_thread_routine (__p=0x7f0424130eb0)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
            #37 0x00007f046a4dae65 in start_thread () from /lib64/libpthread.so.0
            #38 0x00007f046a08f88d in clone () from /lib64/libc.so.6
            

            Other observations:

            • the core file size is 31 GB
            • no "thread leak" observed/reported

            The last report in the log file (before the memory map printed by XRootD was) was:

            [2021-04-11T05:17:22.995Z] {{LWP,854939}{QID,195013#80723}} WARN  util.ThreadPool (core/modules/util/ThreadPool.cc:321) - wait before _poolThreadCount=3001
            *** Error in `/qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/xrootd/lsst-dev-2-gfb72ea7633/bin/xrootd': free(): invalid pointer: 0x00007f0454e7e2b5 ***
            ======= Backtrace: =========
            /lib64/libc.so.6(+0x81679)[0x7f046a012679]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv6xrdsvc10SsiRequestD1Ev+0x13f)[0x7f0454bf34c1]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv6xrdsvc10SsiRequestD0Ev+0x9)[0x7f0454bf3d99]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNSt15_Sp_counted_ptrIPN4lsst5qserv6xrdsvc10SsiRequestELN9__gnu_cxx12_Lock_policyE2EE10_M_di
            sposeEv+0x13)[0x7f0454c07eff]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv5wbase11SendChannelD2Ev+0x6f)[0x7f0454b25c4f]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNSt23_Sp_counted_ptr_inplaceIN4lsst5qserv5wbase11SendChannelESaIS3_ELN9__gnu_cxx12_Lock_pol
            icyE2EE10_M_disposeEv+0x11)[0x7f0454c000a9]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(+0x6b4ea)[0x7f0454b2f4ea]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(+0x6b7af)[0x7f0454b2f7af]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNKSt8functionIFvPN4lsst5qserv4util7CmdDataEEEclES4_+0x18)[0x7f0454b2c44c]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv4util7Command6actionEPNS1_7CmdDataE+0xd)[0x7f0454b2c463]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libqserv_common.so(_ZN4lsst5qserv4util11EventThread10handleCmdsEv+0x1b9)[0x7f0454a78049]
            /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libqserv_common.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJMN4lsst5qserv4util11EventThreadEFvvEPS
            6_EEEEE6_M_runEv+0x22)[0x7f0454a7a5d2]
            /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/libstdc++.so.6(+0xc8163)[0x7f046a427163]
            /lib64/libpthread.so.0(+0x7e65)[0x7f046a4dae65]
            /lib64/libc.so.6(clone+0x6d)[0x7f046a08f88d]
            ======= Memory map: ========
            ...
            

            Investigation status

            Has not been investigated.

            Show
            gapon Igor Gaponenko added a comment - - edited 2021-04-11: The "small" cluster at NCSA, worker db34 The crash was only noticed on 2021-05-07. Qserv kept functioning for alost a month after the crash due to the Replication system maintaining 2 replicas of each chunk. It was only noticed after another worker db33 got crashed and reported in the next comment. Qserv version qserv/qserv:deps_20201210_0904 Replication system's version qserv/replica:tools- DM-27681 Qserv start time (from the log file) 2021-04-08 15:53:04 Qserv crash time (from the logfile) 2021-04-11T05:17:22 Logfile location /qserv/qserv-prod/log/logs.20210507.1/xrootd.log Core file location (the host OS filesystem) /qserv/qserv-prod/core-files/core.224556 Crash stack: Program terminated with signal SIGABRT, Aborted. #0 0x00007f0469fc7337 in raise () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7efe8f186700 (LWP 857256))] (gdb) where #0 0x00007f0469fc7337 in raise () from /lib64/libc.so.6 #1 0x00007f0469fc8a28 in abort () from /lib64/libc.so.6 #2 0x00007f046a009e87 in __libc_message () from /lib64/libc.so.6 #3 0x00007f046a012679 in _int_free () from /lib64/libc.so.6 #4 0x00007f0454bf34c1 in __gnu_cxx::new_allocator<char>::deallocate (__p=<optimized out>, this=0x7f03387e9160) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/ext/new_allocator.h:125 #5 std::allocator_traits<std::allocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/alloc_traits.h:462 #6 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy (__size=<optimized out>, this=0x7f03387e9160) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:226 #7 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose (this=0x7f03387e9160) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:221 #8 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string (this=0x7f03387e9160, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/basic_string.h:647 #9 lsst::qserv::mysql::MySqlConfig::~MySqlConfig (this=0x7f03387e9120, __in_chrg=<optimized out>) at core/modules/mysql/MySqlConfig.h:50 #10 lsst::qserv::xrdsvc::SsiRequest::~SsiRequest (this=0x7f03387e9030, __in_chrg=<optimized out>) at core/modules/xrdsvc/SsiRequest.cc:65 #11 0x00007f0454bf3d99 in lsst::qserv::xrdsvc::SsiRequest::~SsiRequest (this=0x7f03387e9030, __in_chrg=<optimized out>) at core/modules/xrdsvc/SsiRequest.cc:68 #12 0x00007f0454c07eff in std::_Sp_counted_ptr<lsst::qserv::xrdsvc::SsiRequest*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:376 #13 0x00007f0454b25c4f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f033857aaf0) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154 #14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f0338c5dd50, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:684 #15 std::__shared_ptr<lsst::qserv::xrdsvc::SsiRequest, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f0338c5dd48, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1123 #16 std::shared_ptr<lsst::qserv::xrdsvc::SsiRequest>::~shared_ptr (this=0x7f0338c5dd48, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr.h:93 #17 lsst::qserv::wbase::SendChannel::~SendChannel (this=0x7f0338c5dd20, __in_chrg=<optimized out>) at core/modules/wbase/SendChannel.h:55 #18 0x00007f0454c000a9 in __gnu_cxx::new_allocator<lsst::qserv::wbase::SendChannel>::destroy<lsst::qserv::wbase::SendChannel> (__p=<optimized out>, this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/ext/new_allocator.h:140 #19 std::allocator_traits<std::allocator<lsst::qserv::wbase::SendChannel> >::destroy<lsst::qserv::wbase::SendChannel> (__p=<optimized out>, __a=...) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/alloc_traits.h:487 #20 std::_Sp_counted_ptr_inplace<lsst::qserv::wbase::SendChannel, std::allocator<lsst::qserv::wbase::SendChannel>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:535 #21 0x00007f0454b2f4ea in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f0338c5dd10) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154 #22 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:684 #23 std::__shared_ptr<lsst::qserv::wbase::SendChannel, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1123 #24 std::__shared_ptr<lsst::qserv::wbase::SendChannel, (__gnu_cxx::_Lock_policy)2>::reset (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1235 #25 lsst::qserv::wcontrol::Foreman::<lambda(lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData *) const (__closure=<optimized out>) at core/modules/wcontrol/Foreman.cc:124 #26 0x00007f0454b2f7af in std::_Function_handler<void(lsst::qserv::util::CmdData*), lsst::qserv::wcontrol::Foreman::processTask(const std::shared_ptr<lsst::qserv::wbase::Task>&)::<lambda(lsst::qserv::util::CmdData*)> >::_M_invoke(const std::_Any_data &, lsst::qserv::util::CmdData *&&) (__functor=..., __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:316 #27 0x00007f0454b2c44c in std::function<void (lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData*) const (this=<optimized out>, __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:706 #28 0x00007f0454b2c463 in lsst::qserv::util::Command::action (this=<optimized out>, data=<optimized out>) at core/modules/util/Command.h:77 #29 0x00007f0454a78049 in lsst::qserv::util::Command::runAction (data=0x7f04242f2a40, this=<optimized out>) at core/modules/util/Command.h:81 #30 lsst::qserv::util::EventThread::handleCmds (this=0x7f04242f2a40) at core/modules/util/EventThread.cc:55 #31 0x00007f0454a7a5d2 in std::__invoke_impl<void, void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__t=<optimized out>, __f=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:73 #32 std::__invoke<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__fn=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:95 #33 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::_M_invoke<0ul, 1ul> (this=<optimized out>) --Type <RET> for more, q to quit, c to continue without paging-- at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:234 #34 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::operator() (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:243 #35 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> > >::_M_run (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:186 #36 0x00007f046a427163 in std::execute_native_thread_routine (__p=0x7f0424130eb0) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80 #37 0x00007f046a4dae65 in start_thread () from /lib64/libpthread.so.0 #38 0x00007f046a08f88d in clone () from /lib64/libc.so.6 Other observations: the core file size is 31 GB no "thread leak" observed/reported The last report in the log file (before the memory map printed by XRootD was) was: [2021-04-11T05:17:22.995Z] {{LWP,854939}{QID,195013#80723}} WARN util.ThreadPool (core/modules/util/ThreadPool.cc:321) - wait before _poolThreadCount=3001 *** Error in `/qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/xrootd/lsst-dev-2-gfb72ea7633/bin/xrootd': free(): invalid pointer: 0x00007f0454e7e2b5 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x81679)[0x7f046a012679] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv6xrdsvc10SsiRequestD1Ev+0x13f)[0x7f0454bf34c1] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv6xrdsvc10SsiRequestD0Ev+0x9)[0x7f0454bf3d99] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNSt15_Sp_counted_ptrIPN4lsst5qserv6xrdsvc10SsiRequestELN9__gnu_cxx12_Lock_policyE2EE10_M_di sposeEv+0x13)[0x7f0454c07eff] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv5wbase11SendChannelD2Ev+0x6f)[0x7f0454b25c4f] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNSt23_Sp_counted_ptr_inplaceIN4lsst5qserv5wbase11SendChannelESaIS3_ELN9__gnu_cxx12_Lock_pol icyE2EE10_M_disposeEv+0x11)[0x7f0454c000a9] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(+0x6b4ea)[0x7f0454b2f4ea] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(+0x6b7af)[0x7f0454b2f7af] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZNKSt8functionIFvPN4lsst5qserv4util7CmdDataEEEclES4_+0x18)[0x7f0454b2c44c] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libxrdsvc.so(_ZN4lsst5qserv4util7Command6actionEPNS1_7CmdDataE+0xd)[0x7f0454b2c463] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libqserv_common.so(_ZN4lsst5qserv4util11EventThread10handleCmdsEv+0x1b9)[0x7f0454a78049] /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/qserv/12.1-347-g76fc66aba+c6e70cf1cd/lib/libqserv_common.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJMN4lsst5qserv4util11EventThreadEFvvEPS 6_EEEEE6_M_runEv+0x22)[0x7f0454a7a5d2] /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/libstdc++.so.6(+0xc8163)[0x7f046a427163] /lib64/libpthread.so.0(+0x7e65)[0x7f046a4dae65] /lib64/libc.so.6(clone+0x6d)[0x7f046a08f88d] ======= Memory map: ======== ... Investigation status Has not been investigated.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            2021-05-07: The "small" cluster at NCSA, worker db33

            Qserv version qserv/qserv:deps_20201210_0904
            Replication system's version qserv/replica:tools-DM-27681
            Qserv start time (from the log file) 2021-04-08 15:53:04
            Qserv crash time (from the logfile) 2021-05-07T20:14:40
            Logfile location /qserv/qserv-prod/log/logs.210507.1/xrootd.log
            Core file location (the host OS filesystem) /qserv/qserv-prod/core-files/core.217565

            Crash stack:

            Program terminated with signal SIGSEGV, Segmentation fault.
            #0  0x00007f3f6948decc in free () from /lib64/libc.so.6
            [Current thread is 1 (Thread 0x7f3f3bdee700 (LWP 473201))]
            (gdb) where
            #0  0x00007f3f6948decc in free () from /lib64/libc.so.6
            #1  0x00007f3f57ab36a3 in std::deque<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_destroy_data_aux(std::_Deque_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*>, std::_Deque_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*>) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10
            #2  0x00007f3f57ab3731 in log4cxx::helpers::ThreadSpecificData::~ThreadSpecificData() () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10
            #3  0x00007f3f57a25e0f in log4cxx::helpers::APRInitializer::tlsDestruct(void*) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10
            #4  0x00007f3f69951c62 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
            #5  0x00007f3f69951e73 in start_thread () from /lib64/libpthread.so.0
            #6  0x00007f3f6950688d in clone () from /lib64/libc.so.6
            

            Other observations:

            • the core file size is 13 GB
            • no "thread leak" observed/reported
            • nothing spectacular in the logfile

            Investigation status

            Has not been investigated.

            Show
            gapon Igor Gaponenko added a comment - - edited 2021-05-07: The "small" cluster at NCSA, worker db33 Qserv version qserv/qserv:deps_20201210_0904 Replication system's version qserv/replica:tools- DM-27681 Qserv start time (from the log file) 2021-04-08 15:53:04 Qserv crash time (from the logfile) 2021-05-07T20:14:40 Logfile location /qserv/qserv-prod/log/logs.210507.1/xrootd.log Core file location (the host OS filesystem) /qserv/qserv-prod/core-files/core.217565 Crash stack: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f3f6948decc in free () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7f3f3bdee700 (LWP 473201))] (gdb) where #0 0x00007f3f6948decc in free () from /lib64/libc.so.6 #1 0x00007f3f57ab36a3 in std::deque<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_destroy_data_aux(std::_Deque_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*>, std::_Deque_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*>) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10 #2 0x00007f3f57ab3731 in log4cxx::helpers::ThreadSpecificData::~ThreadSpecificData() () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10 #3 0x00007f3f57a25e0f in log4cxx::helpers::APRInitializer::tlsDestruct(void*) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10 #4 0x00007f3f69951c62 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 #5 0x00007f3f69951e73 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f3f6950688d in clone () from /lib64/libc.so.6 Other observations: the core file size is 13 GB no "thread leak" observed/reported nothing spectacular in the logfile Investigation status Has not been investigated.
            Hide
            jgates John Gates added a comment -

            The first crash (crash time 2021-04-11T05:17:22) occured when callingĀ _mysqlConn.reset() in QueryRunner::_initConnection(). _mysqlConn should have had a value of nullptr, but the destructor was called and then the worker crashed. It should have been impossible _mysqlConn to be anything other than nullptr. Chaanges were made to explicitly initialize _mysqlConn to nullptr, add a check to make sure QueryRunner::runQuery() is only called once per instance, and some log statements to hopefully provide a more information if this happens again.

            Show
            jgates John Gates added a comment - The first crash (crash time 2021-04-11T05:17:22) occured when callingĀ _mysqlConn.reset() in QueryRunner::_initConnection(). _mysqlConn should have had a value of nullptr, but the destructor was called and then the worker crashed. It should have been impossible _mysqlConn to be anything other than nullptr. Chaanges were made to explicitly initialize _mysqlConn to nullptr, add a check to make sure QueryRunner::runQuery() is only called once per instance, and some log statements to hopefully provide a more information if this happens again.

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Watchers:
              Andy Hanushevsky, Fritz Mueller, Igor Gaponenko, John Gates
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated: