Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27781

Fix a possible race condition or memory management issues in Qserv query requests processor

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None
    • Story Points:
      8
    • Sprint:
      DB_F20_09, DB_S21_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      1 Observations and testing approaches

      A number of crashes in Qserv workers have been observed. The crashes originate from class qdisp::QueryRequest while processing (or sending back to czar) results of the so-called "large result" queries. All crashes were happening while deallocating objects (mostly - strings). Most crashes are easily reproducible.

      It has also been observed that the frequency of the crashes depends on the following factors:

      • the logging level at workers (more logging tends to hide and delay crashes due to internal serialization within the LSST Logger implementation).
      • the load on the workers (higher he load - more frequent are the crashes)

      In the worst-case scenario, the typical probability of the crashes was 1 worker an hour.

      1.1 Testing

      All investigations were conducted in the "large" Qserv cluster at NCSA (the cluster has 1 master and 30 worker nodes). In order to investigate the crashes the operating systems of the worker nodes were configured to dump core files at the following location:

      /tmp/core-file/
      

      To preserve the core files between container restarts this (local to the Docker containers) path was mapped to the following location at the host OS:

      /qserv/qserv-prod/core-file/
      

      For testing the crashes, the same set of 16 queries was launched from the master node as shown below:

      do for i in $(seq 1 16); do
        mysql --protocol=tcp -P4040 -uqsmaster  -e "SELECT * FROM wise_00.allwise_p3as_psd LIMIT 10" >& /dev/null&
      done;
      wait;
      

      In the curret Qserv implementation, each such query would result in executing 146333 chunk queries by workers and returning 10 rows of data from each such chunk to Qserv czar. The approximate amount of data in each such (partial) result set was on the order of 25 KB. After collecting all partial results czar was throwing all but 10 (randomly selected) rows from the collected results.

      2 The crash caused by a possible race condition

      This crash was the first one that triggered this investigation. The frequency of the crash was not fully investigated. Though, it was reproduced a few times before getting fixed (see below).

      Qserv crashed at the following stack:

      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  0x00007f7c465e434a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f7be025be30)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154
      154		    _M_dispose();
      [Current thread is 1 (Thread 0x7f79d0ff9700 (LWP 15003))]
      (gdb) where
      #0  0x00007f7c465e434a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f7be025be30)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:154
      #1  0x00007f7c468d9ed7 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f79d0ff8d68, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:684
      #2  std::__shared_ptr<lsst::qserv::qdisp::JobStatus, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f79d0ff8d60, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr_base.h:1123
      #3  std::shared_ptr<lsst::qserv::qdisp::JobStatus>::~shared_ptr (this=0x7f79d0ff8d60, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0/bits/shared_ptr.h:93
      #4  lsst::qserv::qdisp::QueryRequest::ProcessResponse (this=0x7f7be022d6d0, eInfo=..., rInfo=...) at core/modules/qdisp/QueryRequest.cc:283
      #5  0x00007f7c45e25930 in XrdSsiResponder::SetResponse (this=this@entry=0x7f7be022e5c0, strmP=strmP@entry=0x7f7be022e610)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSsi/XrdSsiResponder.cc:307
      #6  0x00007f7c45e2bc92 in XrdSsiTaskReal::XeqEvent (this=0x7f7be022e4f0, status=0x7f7b382e7f30, respP=0x7f79d0ff8e58)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSsi/XrdSsiTaskReal.cc:708
      #7  0x00007f7c45e231f9 in XrdSsiEvent::DoIt (this=0x7f7be022e4f0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSsi/XrdSsiEvent.cc:162
      #8  0x00007f7c4547f943 in XrdScheduler::Run (this=0x561298c10bd0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/Xrd/XrdScheduler.cc:357
      #9  0x00007f7c4547faaa in XrdStartWorking (carg=<optimized out>)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/Xrd/XrdScheduler.cc:87
      #10 0x00007f7c4543fb08 in XrdSysThread_Xeq (myargs=0x7f7878117410)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysPthread.cc:86
      #11 0x00007f7c476d5e65 in start_thread () from /lib64/libpthread.so.0
      #12 0x00007f7c473fe88d in clone () from /lib64/libc.so.6
      

      And this is a location of the crash in the code:

      // Must not throw exceptions: calling thread cannot trap them.
      // Callback function for XrdSsiRequest.
      //
      bool QueryRequest::ProcessResponse(XrdSsiErrInfo  const& eInfo, XrdSsiRespInfo const& rInfo) {
          QSERV_LOGCONTEXT_QUERY_JOB(_qid, _jobid);
          LOGS(_log, LOG_LVL_DEBUG, "workerName=" << GetEndPoint() << " ProcessResponse");
          std::string errorDesc = _jobIdStr + " ";
          if (isQueryCancelled()) {
              LOGS(_log, LOG_LVL_WARN, "QueryRequest::ProcessResponse job already cancelled");
              cancel(); // calls _errorFinish()
              return true;
          }
          // Make a copy of the _jobQuery shared_ptr in case _jobQuery gets reset by a call to  cancel()
          auto jq = _jobQuery;
          {   
              std::lock_guard<std::mutex> lock(_finishStatusMutex);
              if (_finishStatus != ACTIVE) {
                  LOGS(_log, LOG_LVL_WARN,
                       "QueryRequest::GetRequest called after job finished (cancelled?)");
                  return true;
          }   
          ...
          case XrdSsiRespInfo::isStream: // All remote requests
              jq->getStatus()->updateInfo(_jobIdStr, JobStatus::RESPONSE_READY);
              ^^^^^^^^^^^^^^^^^^
      

      2.1 Code analysis and the proposed solution

      It's possible that the crash is due to a race condition resulting in a non-valid state of the shared pointer (that is used in the last line of the above-shown code snippet).

      The new code will add an extra check for the state of the pointer after it's initialed to see if that would avoid the crash.

      2.2 Additional improvememts

      Change logging levels in the configuration files and the code to reduce noise in the log files.

      2.3 An outcome of the fix

      The fix has solved the original problem. Unfortunatelly, this was not the only issue with worker implementation. See the next crash below.

      3 Crashes when deallocating strings

      This crash was the most frequent:

      Program terminated with signal SIGABRT, Aborted.
      #0  0x00007fdf83816337 in raise () from /lib64/libc.so.6
      [Current thread is 1 (Thread 0x7fdd02c3c700 (LWP 33873))]
      (gdb) where
      #0  0x00007fdf83816337 in raise () from /lib64/libc.so.6
      #1  0x00007fdf83817a28 in abort () from /lib64/libc.so.6
      #2  0x00007fdf83858e87 in __libc_message () from /lib64/libc.so.6
      #3  0x00007fdf83861679 in _int_free () from /lib64/libc.so.6
      #4  0x00007fdf6a01572a in __gnu_cxx::new_allocator<char>::deallocate (__p=<optimized out>, this=0x7fde0c13d830)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/ext/new_allocator.h:119
      #5  std::allocator_traits<std::allocator<char> >::deallocate (__n=<optimized out>, __p=<optimized out>, __a=...)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/alloc_traits.h:470
      #6  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy (__size=<optimized out>, this=0x7fde0c13d830)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/basic_string.h:237
      #7  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose (this=0x7fde0c13d830)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/basic_string.h:232
      #8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string (this=0x7fde0c13d830, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/basic_string.h:658
      #9  google::protobuf::internal::ArenaStringPtr::DestroyNoArena (default_value=<optimized out>, this=0x7fde0c13d7e8)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/include/google/protobuf/arenastring.h:358
      #10 lsst::qserv::proto::ColumnSchema::SharedDtor (this=0x7fde0c13d7c0) at build/proto/worker.pb.cc:3342
      #11 lsst::qserv::proto::ColumnSchema::~ColumnSchema (this=0x7fde0c13d7c0, __in_chrg=<optimized out>) at build/proto/worker.pb.cc:3334
      #12 0x00007fdf6a015abd in lsst::qserv::proto::ColumnSchema::~ColumnSchema (this=0x7fde0c13d7c0, __in_chrg=<optimized out>) at build/proto/worker.pb.cc:3332
      #13 0x00007fdf6a02e9a0 in google::protobuf::internal::GenericTypeHandler<lsst::qserv::proto::ColumnSchema>::Delete (arena=0x0, value=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/include/google/protobuf/repeated_field.h:1657
      #14 google::protobuf::internal::RepeatedPtrFieldBase::Destroy<google::protobuf::RepeatedPtrField<lsst::qserv::proto::ColumnSchema>::TypeHandler> (this=this@entry=0x7fde0c11c340)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/include/google/protobuf/repeated_field.h:1662
      #15 0x00007fdf6a02e9d7 in google::protobuf::RepeatedPtrField<lsst::qserv::proto::ColumnSchema>::~RepeatedPtrField (this=0x7fde0c11c340, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/include/google/protobuf/repeated_field.h:2121
      #16 0x00007fdf6a02a568 in lsst::qserv::proto::RowSchema::~RowSchema (this=0x7fde0c11c330, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/include/google/protobuf/metadata_lite.h:147
      #17 0x00007fdf6a02a5c1 in lsst::qserv::proto::RowSchema::~RowSchema (this=0x7fde0c11c330, __in_chrg=<optimized out>) at build/proto/worker.pb.cc:3694
      #18 0x00007fdf6a02ab1c in lsst::qserv::proto::Result::SharedDtor (this=0x7fde0c1fd5f0) at build/proto/worker.pb.cc:4200
      #19 lsst::qserv::proto::Result::~Result (this=0x7fde0c1fd5f0, __in_chrg=<optimized out>) at build/proto/worker.pb.cc:4193
      #20 0x00007fdf6a120791 in __gnu_cxx::new_allocator<lsst::qserv::proto::Result>::destroy<lsst::qserv::proto::Result> (__p=<optimized out>, this=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/ext/new_allocator.h:151
      #21 std::allocator_traits<std::allocator<lsst::qserv::proto::Result> >::destroy<lsst::qserv::proto::Result> (__p=<optimized out>, __a=...)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/alloc_traits.h:497
      #22 std::_Sp_counted_ptr_inplace<lsst::qserv::proto::Result, std::allocator<lsst::qserv::proto::Result>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:557
      #23 0x00007fdf6a0fe0c8 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fde0c1fd5e0)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/ext/atomicity.h:69
      #24 0x00007fdf6a1151af in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1169
      #25 std::__shared_ptr<lsst::qserv::proto::Result, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1169
      #26 std::__shared_ptr<lsst::qserv::proto::Result, (__gnu_cxx::_Lock_policy)2>::reset (this=0x7fde0c215070)
          at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1287
      #27 lsst::qserv::wdb::QueryRunner::_transmit (this=this@entry=0x7fde0c214f30, last=last@entry=true, rowCount=10, tSize=<optimized out>) at core/modules/wdb/QueryRunner.cc:329
      #28 0x00007fdf6a118b85 in lsst::qserv::wdb::QueryRunner::_dispatchChannel (this=this@entry=0x7fde0c214f30) at core/modules/wdb/QueryRunner.cc:499
      #29 0x00007fdf6a119e7b in lsst::qserv::wdb::QueryRunner::runQuery (this=0x7fde0c214f30) at core/modules/wdb/QueryRunner.cc:188
      

      Observation: one strange thing about the crashes was that they all were happening when deallocating "small" strings. Those strings have sizes less or equal to 16, in which case the payload. the string is contained within the string object itself rather than being allocated as a separate object on the heap. The strings were either inside the Potobuf objects of class proto::Result, or they were copied from the later. This may suggest a possible bug in Protobuf implementation.

      3.1 Upgrading Protobuf version

      The first attempt to fix the crash was by upgrading Protobuf from version 3.9.2 to the latest version in the series 3.14.0. Unfortunately, this didn't help.

      3.2 Migrating to Protobuf Arena allocation

      This improvement has solved the problem (with the above presented) crash. No more crashes of this kind were seen within the first 24 hours of running Qserv under stress tests.

      4 Other crashes (observed in this context, to be investigated and fixd as separate efforts)

      These crashes are not directly related to handling a payload of the query result sets. Each will be investigated as a separate effort. Each of the crash will be reported in the comments to the current ticket below. The current branch will only be used to collect statistics on the frequencies of the crashes.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment -

            Crash in lsst::log::Log::isEnabledFor

            location core file origin
            qserv-db20 core.1077 wdb::QueryRunner::_dispatchChannel, line 497

            The crash stack:

            Program terminated with signal SIGABRT, Aborted.
            #0  0x00007fb229a75337 in raise () from /lib64/libc.so.6
            [Current thread is 1 (Thread 0x7faac46c3700 (LWP 933902))]
            (gdb) 
            (gdb) where
            #0  0x00007fb229a75337 in raise () from /lib64/libc.so.6
            #1  0x00007fb229a76a28 in abort () from /lib64/libc.so.6
            #2  0x00007fb229ab7e87 in __libc_message () from /lib64/libc.so.6
            #3  0x00007fb229abe7c4 in malloc_printerr () from /lib64/libc.so.6
            #4  0x00007fb229ac1f88 in _int_malloc () from /lib64/libc.so.6
            #5  0x00007fb229ac4adc in malloc () from /lib64/libc.so.6
            #6  0x00007fb229eba52d in operator new (sz=56)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/new_op.cc:50
            #7  0x00007fb217e439fa in log4cxx::Level::getDebug() () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11
            #8  0x00007fb217e43e06 in log4cxx::Level::toLevel(int, log4cxx::helpers::ObjectPtrT<log4cxx::Level> const&) ()
               from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11
            #9  0x00007fb217e43eba in log4cxx::Level::toLevel(int) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11
            #10 0x00007fb217f20f26 in lsst::log::Log::isEnabledFor (this=this@entry=0x7faac46c2140, level=level@entry=10000) at src/Log.cc:334
            #11 0x00007fb2143aea06 in lsst::qserv::wdb::ChunkResourceRequest::getResourceFragment (this=this@entry=0x7faac46c2700, i=i@entry=0) at core/modules/wdb/QueryRunner.cc:430
            #12 0x00007fb2143ab305 in lsst::qserv::wdb::QueryRunner::_dispatchChannel (this=this@entry=0x7fb1b07386d0) at core/modules/wdb/QueryRunner.cc:497
            #13 0x00007fb2143ad28f in lsst::qserv::wdb::QueryRunner::runQuery (this=0x7fb1b07386d0) at core/modules/wdb/QueryRunner.cc:193
            

            The crash in the last line of this code:

            bool QueryRunner::_dispatchChannel() {
                proto::TaskMsg& m = *_task->msg;
                _initMsgs();
                bool firstResult = true;
                bool erred = false;
                int numFields = -1;
                if (m.fragment_size() < 1) {
                    throw Bug("QueryRunner: No fragments to execute in TaskMsg");
                }
                ChunkResourceRequest req(_chunkResourceMgr, m);
             
                unsigned int rowCount = 0;
                size_t tSize = 0;
             
                try {
                    for(int i=0; i < m.fragment_size(); ++i) {
                        if (_cancelled) {
                            break;
                        }
                        proto::TaskMsg_Fragment const& fragment(m.fragment(i));
                        std::vector<std::string> queries;
                        for (const std::string queryStr: fragment.query()) {
                            if (fragment.has_subchunks() && false == fragment.subchunks().id().empty()) {
                                for (auto subchunkId : fragment.subchunks().id()) {
                                    std::string s(queryStr);
                                    boost::algorithm::replace_all(s, SUBCHUNK_TAG, std::to_string(subchunkId));
                                    queries.push_back(s);
                                }
                            } else {
                                queries.push_back(queryStr);
                            }
                        }
                        ChunkResource cr(req.getResourceFragment(i));
            

            The crash is in method ChunkResourceRequest::getResourceFragment when calling LSST Logger.

            Show
            gapon Igor Gaponenko added a comment - Crash in lsst::log::Log::isEnabledFor location core file origin qserv-db20 core.1077 wdb::QueryRunner::_dispatchChannel , line 497 The crash stack: Program terminated with signal SIGABRT, Aborted. #0 0x00007fb229a75337 in raise () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7faac46c3700 (LWP 933902))] (gdb) (gdb) where #0 0x00007fb229a75337 in raise () from /lib64/libc.so.6 #1 0x00007fb229a76a28 in abort () from /lib64/libc.so.6 #2 0x00007fb229ab7e87 in __libc_message () from /lib64/libc.so.6 #3 0x00007fb229abe7c4 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007fb229ac1f88 in _int_malloc () from /lib64/libc.so.6 #5 0x00007fb229ac4adc in malloc () from /lib64/libc.so.6 #6 0x00007fb229eba52d in operator new (sz=56) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/new_op.cc:50 #7 0x00007fb217e439fa in log4cxx::Level::getDebug() () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11 #8 0x00007fb217e43e06 in log4cxx::Level::toLevel(int, log4cxx::helpers::ObjectPtrT<log4cxx::Level> const&) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11 #9 0x00007fb217e43eba in log4cxx::Level::toLevel(int) () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-5c8ee05/lib/liblog4cxx.so.11 #10 0x00007fb217f20f26 in lsst::log::Log::isEnabledFor (this=this@entry=0x7faac46c2140, level=level@entry=10000) at src/Log.cc:334 #11 0x00007fb2143aea06 in lsst::qserv::wdb::ChunkResourceRequest::getResourceFragment (this=this@entry=0x7faac46c2700, i=i@entry=0) at core/modules/wdb/QueryRunner.cc:430 #12 0x00007fb2143ab305 in lsst::qserv::wdb::QueryRunner::_dispatchChannel (this=this@entry=0x7fb1b07386d0) at core/modules/wdb/QueryRunner.cc:497 #13 0x00007fb2143ad28f in lsst::qserv::wdb::QueryRunner::runQuery (this=0x7fb1b07386d0) at core/modules/wdb/QueryRunner.cc:193 The crash in the last line of this code: bool QueryRunner::_dispatchChannel() { proto::TaskMsg& m = *_task->msg; _initMsgs(); bool firstResult = true ; bool erred = false ; int numFields = -1; if (m.fragment_size() < 1) { throw Bug( "QueryRunner: No fragments to execute in TaskMsg" ); } ChunkResourceRequest req(_chunkResourceMgr, m);   unsigned int rowCount = 0; size_t tSize = 0;   try { for ( int i=0; i < m.fragment_size(); ++i) { if (_cancelled) { break ; } proto::TaskMsg_Fragment const & fragment(m.fragment(i)); std::vector<std::string> queries; for ( const std::string queryStr: fragment.query()) { if (fragment.has_subchunks() && false == fragment.subchunks().id().empty()) { for (auto subchunkId : fragment.subchunks().id()) { std::string s(queryStr); boost::algorithm::replace_all(s, SUBCHUNK_TAG, std::to_string(subchunkId)); queries.push_back(s); } } else { queries.push_back(queryStr); } } ChunkResource cr(req.getResourceFragment(i)); The crash is in method ChunkResourceRequest::getResourceFragment when calling LSST Logger.
            Hide
            gapon Igor Gaponenko added a comment -

            Crash in XrdSsiFileSess::open

            The crash was observed while worker db01 was processing a large number (up to 100,000 at a time) of the Replication system's requests. It could be caused by a shortage of resources. The problem will be watched and investigated in case if it will repeat itself.

            location core file origin
            qserv-db01 core.736 XrdSsiFileSess::open, line 424

            The crash stack:

            Program terminated with signal SIGABRT, Aborted.
            #0  0x00007f54ac665337 in raise () from /lib64/libc.so.6
            [Current thread is 1 (Thread 0x7f4e86333700 (LWP 58438))]
            (gdb) where
            #0  0x00007f54ac665337 in raise () from /lib64/libc.so.6
            #1  0x00007f54ac666a28 in abort () from /lib64/libc.so.6
            #2  0x00007f54ac6a7e87 in __libc_message () from /lib64/libc.so.6
            #3  0x00007f54ac6ae7c4 in malloc_printerr () from /lib64/libc.so.6
            #4  0x00007f54ac6b1f88 in _int_malloc () from /lib64/libc.so.6
            #5  0x00007f54ac6b4adc in malloc () from /lib64/libc.so.6
            #6  0x00007f54ac6bbaaa in strdup () from /lib64/libc.so.6
            #7  0x00007f54980af459 in XrdSsiFileSess::open (this=0x7f5371bce3e0, path=path@entry=0x7f5320583800 "/worker/db01", theEnv=..., open_mode=open_mode@entry=0)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSsi/XrdSsiFileSess.cc:424
            #8  0x00007f54980aba5b in XrdSsiFile::open (this=0x7f533177dd00, path=0x7f5320583800 "/worker/db01", open_mode=0, Mode=<optimized out>, client=<optimized out>, 
                info=<optimized out>)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSsi/XrdSsiFile.cc:249
            #9  0x00007f54acf19001 in XrdXrootdProtocol::do_Open (this=<optimized out>)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdXrootd/XrdXrootdXeq.cc:1446
            #10 0x00007f54ace6ebb0 in XrdLinkXeq::DoIt (this=<optimized out>)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdLinkXeq.cc:302
            #11 XrdLinkXeq::DoIt (this=0x7f53f09be830)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdLinkXeq.cc:290
            #12 0x00007f54ace71ea0 in XrdScheduler::Run (this=0x7f54aceb66c0 <XrdGlobal::Sched>)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdScheduler.cc:382
            #13 0x00007f54ace71fea in XrdStartWorking (carg=<optimized out>)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdScheduler.cc:88
            #14 0x00007f54ace2f429 in XrdSysThread_Xeq (myargs=0x7f544cd67010)
                at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSys/XrdSysPthread.cc:86
            

            Show
            gapon Igor Gaponenko added a comment - Crash in XrdSsiFileSess::open The crash was observed while worker db01 was processing a large number (up to 100,000 at a time) of the Replication system's requests. It could be caused by a shortage of resources. The problem will be watched and investigated in case if it will repeat itself. location core file origin qserv-db01 core.736 XrdSsiFileSess::open , line 424 The crash stack: Program terminated with signal SIGABRT, Aborted. #0 0x00007f54ac665337 in raise () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7f4e86333700 (LWP 58438))] (gdb) where #0 0x00007f54ac665337 in raise () from /lib64/libc.so.6 #1 0x00007f54ac666a28 in abort () from /lib64/libc.so.6 #2 0x00007f54ac6a7e87 in __libc_message () from /lib64/libc.so.6 #3 0x00007f54ac6ae7c4 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f54ac6b1f88 in _int_malloc () from /lib64/libc.so.6 #5 0x00007f54ac6b4adc in malloc () from /lib64/libc.so.6 #6 0x00007f54ac6bbaaa in strdup () from /lib64/libc.so.6 #7 0x00007f54980af459 in XrdSsiFileSess::open (this=0x7f5371bce3e0, path=path@entry=0x7f5320583800 "/worker/db01", theEnv=..., open_mode=open_mode@entry=0) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSsi/XrdSsiFileSess.cc:424 #8 0x00007f54980aba5b in XrdSsiFile::open (this=0x7f533177dd00, path=0x7f5320583800 "/worker/db01", open_mode=0, Mode=<optimized out>, client=<optimized out>, info=<optimized out>) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSsi/XrdSsiFile.cc:249 #9 0x00007f54acf19001 in XrdXrootdProtocol::do_Open (this=<optimized out>) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdXrootd/XrdXrootdXeq.cc:1446 #10 0x00007f54ace6ebb0 in XrdLinkXeq::DoIt (this=<optimized out>) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdLinkXeq.cc:302 #11 XrdLinkXeq::DoIt (this=0x7f53f09be830) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdLinkXeq.cc:290 #12 0x00007f54ace71ea0 in XrdScheduler::Run (this=0x7f54aceb66c0 <XrdGlobal::Sched>) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdScheduler.cc:382 #13 0x00007f54ace71fea in XrdStartWorking (carg=<optimized out>) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/Xrd/XrdScheduler.cc:88 #14 0x00007f54ace2f429 in XrdSysThread_Xeq (myargs=0x7f544cd67010) at /qserv/stack/stack/miniconda3-py37_4.8.2-5c8ee05/EupsBuildDir/Linux64/xrootd-lsst-dev-2-gfb72ea7633/xrootd-lsst-dev-2-gfb72ea7633/src/XrdSys/XrdSysPthread.cc:86
            Hide
            jgates John Gates added a comment -

            If workers are still crashing, it's probably worth opening a new ticket with a link back to this one and logging crashes that occur while using arenas. Just a minor comment about removing unused code. Looks good.

            Show
            jgates John Gates added a comment - If workers are still crashing, it's probably worth opening a new ticket with a link back to this one and logging crashes that occur while using arenas. Just a minor comment about removing unused code. Looks good.
            Hide
            abh Andy Hanushevsky added a comment -

            This last crash can't be due to resource constraints. If that were the case, malloc would not try to print a message. I suspect the message is about heap corruption but it would be good to know what it was.

            Show
            abh Andy Hanushevsky added a comment - This last crash can't be due to resource constraints. If that were the case, malloc would not try to print a message. I suspect the message is about heap corruption but it would be good to know what it was.
            Hide
            gapon Igor Gaponenko added a comment - - edited

            That's possible. Note that the Replication system's code (that runs within Qserv workers) hasn't been migrated to Protobuf Arena. I'm planning to do this as a separate effort. There is a possible correlation between the kind of (stress) tests I was running (specifically) against the crashed worker and the origin of the crash. I'm going to repeat my tests to see if I could reproduce that last crash.

            Quite frankly, I still don't understand why migrating to Arena has stopped the earlier observed crashes (in the query processing code). Perhaps, the problem still exists and it has been hidden by the change.

            Show
            gapon Igor Gaponenko added a comment - - edited That's possible. Note that the Replication system's code (that runs within Qserv workers) hasn't been migrated to Protobuf Arena. I'm planning to do this as a separate effort. There is a possible correlation between the kind of (stress) tests I was running (specifically) against the crashed worker and the origin of the crash. I'm going to repeat my tests to see if I could reproduce that last crash. Quite frankly, I still don't understand why migrating to Arena has stopped the earlier observed crashes (in the query processing code). Perhaps, the problem still exists and it has been hidden by the change .

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              John Gates
              Watchers:
              Andy Hanushevsky, Fabrice Jammes, Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: