Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26574

Worker crash in wpublish::GetStatusCommand

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Crash noticed while dispatching and monitoring (dashboard) 10 concurrent shared scans (5 medium; 5 fast) on the large Qserv cluster at NCSA. Running container version qserv/qserv:tickets_DM-26207. Stack trace as follows:

      #0  0x00007ffb1a871337 in raise () from /lib64/libc.so.6
      #1  0x00007ffb1a872a28 in abort () from /lib64/libc.so.6
      #2  0x00007ffb1acb681c in __gnu_cxx::__verbose_terminate_handler ()
          at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
      #3  0x00007ffb1acb4f19 in __cxxabiv1::__terminate (handler=<optimized out>)
          at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
      #4  0x00007ffb1acb4f4f in std::terminate ()
          at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
      #5  0x00007ffb1acb512c in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7ffa60184000, 
          tinfo=tinfo@entry=0x7ffb0c6b8f80 <typeinfo for nlohmann::detail::type_error>, dest=dest@entry=
          0x7ffb0c5ae21c <nlohmann::detail::type_error::~type_error()>)
          at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
      #6  0x00007ffb0c604162 in nlohmann::detail::serializer<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::dump_escaped (ensure_ascii=false, s=..., this=0x7ffaedff2a70)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:14169
      #7  nlohmann::detail::serializer<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::dump (this=this@entry=0x7ffaedff2a70, val=..., pretty_print=pretty_print@entry=false, 
          ensure_ascii=ensure_ascii@entry=false, indent_step=indent_step@entry=0, current_indent=current_indent@entry=0)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:13979
      #8  0x00007ffb0c603d74 in nlohmann::detail::serializer<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::dump (this=this@entry=0x7ffaedff2a70, val=..., pretty_print=pretty_print@entry=false, 
          ensure_ascii=ensure_ascii@entry=false, indent_step=indent_step@entry=0, current_indent=current_indent@entry=0)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:13962
      #9  0x00007ffb0c603d74 in nlohmann::detail::serializer<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::dump (this=this@entry=0x7ffaedff2a70, val=..., pretty_print=pretty_print@entry=false, 
          ensure_ascii=ensure_ascii@entry=false, indent_step=indent_step@entry=0, current_indent=current_indent@entry=0)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:13962
      #10 0x00007ffb0c60393d in nlohmann::detail::serializer<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::dump (this=this@entry=0x7ffaedff2a70, val=..., pretty_print=pretty_print@entry=false, 
          ensure_ascii=ensure_ascii@entry=false, indent_step=indent_step@entry=0, current_indent=current_indent@entry=0)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:13909
      #11 0x00007ffb0c5ff2c1 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>::dump (error_handler=nlohmann::detail::error_handler_t::strict, ensure_ascii=false, indent_char=32 ' ', indent=-1, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>, this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/nlohmann/json.hpp:16574
      #12 lsst::qserv::wpublish::GetStatusCommand::run (this=<optimized out>) at core/modules/wpublish/GetStatusCommand.cc:66
      #13 0x00007ffb0c5a7138 in lsst::qserv::wbase::WorkerCommand::<lambda(lsst::qserv::util::CmdData*)>::operator() (data=<optimized out>, __closure=<optimized out>) at core/modules/wbase/WorkerCommand.cc:51
      #14 std::_Function_handler<void(lsst::qserv::util::CmdData*), lsst::qserv::wbase::WorkerCommand::WorkerCommand(const Ptr&)::<lambda(lsst::qserv::util::CmdData*)> >::_M_invoke(const std::_Any_data &, lsst::qserv::util::CmdData *&&) (__functor=..., __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:316
      #15 0x00007ffb0c5a641c in std::function<void (lsst::qserv::util::CmdData*)>::operator()(lsst::qserv::util::CmdData*) const (this=<optimized out>, __args#0=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/std_function.h:706
      #16 0x00007ffb0c5a6433 in lsst::qserv::util::Command::action (this=<optimized out>, data=<optimized out>) at core/modules/util/Command.h:77
      #17 0x00007ffb0c4f624d in lsst::qserv::util::Command::runAction (data=0x55a81449d170, this=<optimized out>) at core/modules/util/Command.h:81
      #18 lsst::qserv::util::EventThread::handleCmds (this=0x55a81449d170) at core/modules/util/EventThread.cc:55
      #19 0x00007ffb0c4f87d6 in std::__invoke_impl<void, void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__t=<optimized out>, __f=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:73
      #20 std::__invoke<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> (__fn=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/invoke.h:95
      #21 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::_M_invoke<0ul, 1ul> (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:234
      #22 std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> >::operator() (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:243
      #23 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (lsst::qserv::util::EventThread::*)(), lsst::qserv::util::EventThread*> > >::_M_run (this=<optimized out>) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/thread:186
      #24 0x00007ffb1acd1163 in std::execute_native_thread_routine (__p=0x55a81449ce70) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
      #25 0x00007ffb1ad84e65 in start_thread () from /lib64/libpthread.so.0
      #26 0x00007ffb1a93988d in clone () from /lib64/libc.so.6
      

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment -

            A race condition has been found in the implementation of the method:

            // File: core/modules/wpublish/ResourceMonitor.cc
            json ResourceMonitor::statusToJson() const {
                json result = json::array();
                for (auto&& entry: _resourceCounter) {
                    result.push_back({entry.first, entry.second});
                }
                return result;
            }
            

            The code is processing a complex data structure that could be modified by other methods of the class called from other threads. This may result in corrupting the JSON object.

            Show
            gapon Igor Gaponenko added a comment - A race condition has been found in the implementation of the method: // File: core/modules/wpublish/ResourceMonitor.cc json ResourceMonitor::statusToJson() const { json result = json::array(); for (auto&& entry: _resourceCounter) { result.push_back({entry.first, entry.second}); } return result; } The code is processing a complex data structure that could be modified by other methods of the class called from other threads. This may result in corrupting the JSON object.
            Show
            gapon Igor Gaponenko added a comment - PR: https://github.com/lsst/qserv/pull/619
            Hide
            gapon Igor Gaponenko added a comment -

            Reviewed by John Gates

            Show
            gapon Igor Gaponenko added a comment - Reviewed by John Gates

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              fritzm Fritz Mueller
              Reviewers:
              John Gates
              Watchers:
              Fritz Mueller, Igor Gaponenko, John Gates
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: