Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-27240

Master controller crashes in XRootD stack

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      The controller crashed while communicating with XRootD services and reported:

      Program terminated with signal SIGSEGV, Segmentation fault.
      

      The stack of the crash s presented below:

      (gdb) where
      #0  XrdCl::PollerBuiltIn::RemoveSocket (this=0x559a5c5f5230, socket=0x7f52990d8380)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPollerBuiltIn.cc:330
      #1  0x00007f5611f8c502 in XrdCl::AsyncSocketHandler::Close (this=0x7f52990eb5f0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:197
      #2  0x00007f5611f20419 in XrdCl::Stream::ForceError (this=0x7f52990290b0, status=...)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClStream.cc:863
      #3  0x00007f5611f1d7cf in XrdCl::Channel::ForceDisconnect (this=0x7f52990eb180)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClChannel.cc:362
      #4  0x00007f5611f1bed9 in XrdCl::PostMaster::ForceDisconnect (this=0x559a5c5f3370, url=...)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPostMaster.cc:292
      #5  0x00007f5611f21742 in XrdCl::Stream::OnReadTimeout (this=0x7f52990290b0, substream=<optimized out>, isBroken=@0x7f560c981cb7: false)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClStream.cc:1013
      #6  0x00007f5611f8d1f1 in XrdCl::AsyncSocketHandler::OnReadTimeout (this=0x7f52990eb5f0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:932
      #7  0x00007f5611f8ef3d in XrdCl::AsyncSocketHandler::Event (this=0x7f52990eb5f0, type=<optimized out>)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:243
      #8  0x00007f5611f19968 in (anonymous namespace)::SocketCallBack::Event (this=0x7f5298a503f0, chP=<optimized out>, cbArg=<optimized out>, evFlags=<optimized out>)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPollerBuiltIn.cc:82
      #9  0x00007f5610057024 in XrdSys::IOEvents::Poller::CbkXeq (this=this@entry=0x559a5c5f9010, cP=cP@entry=0x7f52990282d0, events=events@entry=2, eNum=<optimized out>, 
          eNum@entry=0, eTxt=eTxt@entry=0x0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:693
      #10 0x00007f5610057731 in XrdSys::IOEvents::Poller::CbkTMO (this=0x559a5c5f9010)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:598
      #11 0x00007f5610058849 in XrdSys::IOEvents::PollE::Begin (this=0x559a5c5f9010, syncsem=<optimized out>, retcode=<optimized out>, eTxt=<optimized out>)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/./XrdSys/XrdSysIOEventsPollE.icc:217
      #12 0x00007f561005496e in XrdSys::IOEvents::BootStrap::Start (parg=0x7ffc39f32d00)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:131
      #13 0x00007f561005db08 in XrdSysThread_Xeq (myargs=0x559a5c5f90e0)
          at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysPthread.cc:86
      #14 0x00007f56104b9e65 in start_thread () from /lib64/libpthread.so.0
      #15 0x00007f56101e288d in clone () from /lib64/libc.so.6
      

      The crash happended in the replication system's container qserv/replica:tools-DM-27091 based on qserv/qserv:deps_20200901_0156.
      The core file was captured in teh container snapshot qserv/replica:tools-DM-27091-xrootd-crash. This container already has gdb installed
      Here is how to obtain the crash stack:

      docker pull qserv/replica:tools-DM-27091-xrootd-crash
      docker run -it --rm --privileged --cap-add=SYS_PTRACE qserv/replica:tools-DM-27091-xrootd-crash lsst bash
      

      The last command will launch bash inside the conateiner in the LSST Stack environment. At this step one will also see the following complaines, which can be safely ignored:

      ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-cc
      ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-gfortran
      ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-c++
      ERROR: activate-gxx_linux-64.sh failed, see above for details
      

      Launch gdb inside the container:

      gdb /qserv/bin/qserv-replica-master-http /tmp/core.439
      

        Attachments

          Activity

          Hide
          abh Andy Hanushevsky added a comment -

          In closing (as I really have to get to sleep), I would stay away from any package that deems itself as the "master of the universe". It's a sure way to lock yourself into hell. The SSI framework was never written that way. The package simply assumes that there is a transport that provides the needed functionality (and most transports do). So, you could substitute any transport you wanted by simply coding to the virtual interface. That was the whole idea (and requirement by Daniel) that we didn't get locked into any particular underlying framework. The application did not need to change just because you decided to use x instead of xrootd. In this sense I think the SSI framework succeeded quite admirably.

          Show
          abh Andy Hanushevsky added a comment - In closing (as I really have to get to sleep), I would stay away from any package that deems itself as the "master of the universe". It's a sure way to lock yourself into hell. The SSI framework was never written that way. The package simply assumes that there is a transport that provides the needed functionality (and most transports do). So, you could substitute any transport you wanted by simply coding to the virtual interface. That was the whole idea (and requirement by Daniel) that we didn't get locked into any particular underlying framework. The application did not need to change just because you decided to use x instead of xrootd. In this sense I think the SSI framework succeeded quite admirably.
          Hide
          abh Andy Hanushevsky added a comment -

          Hi mgmt. Ah, there is mutual respect here. I wouldn't be arguing with Igor if I didn't respect his comments. I would simply ignore him. This is Slavic way of doing things.

          Show
          abh Andy Hanushevsky added a comment - Hi mgmt. Ah, there is mutual respect here. I wouldn't be arguing with Igor if I didn't respect his comments. I would simply ignore him. This is Slavic way of doing things.
          Hide
          abh Andy Hanushevsky added a comment -

          mgmt: Yes, I was likely more inflammatory than necessary. After reading through the posts again I see that Igor was actually looking for a timeout solution  (didn't get past the azio comment I suppose). Anyway, I think we probably resolved this, maybe. I will check back with Igor.

          Show
          abh Andy Hanushevsky added a comment - mgmt: Yes, I was likely more inflammatory than necessary. After reading through the posts again I see that Igor was actually looking for a timeout solution  (didn't get past the azio comment I suppose). Anyway, I think we probably resolved this, maybe. I will check back with Igor.
          Hide
          gapon Igor Gaponenko added a comment - - edited

          Reproducing the problem

          A special test application has been implemented to study the problem in an isolated (from the Master Controller) context. The application allows launching the specified number of the "echo"-type requests to a worker while limiting the life expectancy of the requests. There is an additional option that limits the total number of "in-flight" requests at each moment of time. The general idea here is to load the worker (or the xrootd/ssi network) to a level at which the processing time of requests would exceed the specified timeout (the "life expectancy") which would result in the client-side request cancellation. This will create the desired condition when the ssi callback would be invoked on a request deleted due to its expiration.

          The application would act like unit tests, though unlike the latter it requires a running worker server. The application will be used to test an improved version of the code.

          Running the test

          Here is how the application would be run from inside the Replication container (insignificant options are ommited):

          /qserv/bin/qserv-replica-worker-ping \
              db01 \
             "TEST-STRING-TO-BE-ECHOED-BACK-AND-MAKE-IT-AS-LONG-AS-POSSIBLE" \
             --expiration-ival-sec=1 \
             --num-requests=2000000 \
             --max-requests=100000
          

          Where:

          option description
          expiration-ival-sec An expiration interval of the requests
          num-requests The total number of requests to be processed
          max-requests The maximum number of the "in-flight" requests at each time

          Results

          Running the above-shown command (with the specified values of the parameters) results in the following crashes:

          [libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'lsst.qserv.proto.WorkerCommandTestEchoR.value' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. 
          terminate called after throwing an instance of 'std::logic_error'
            what():  QservMgtRequest::state2string(ExtendedState)  incomplete implementation
          Aborted (core dumped)
          

          terminate called after throwing an instance of 'std::logic_error'
            what():  QservMgtRequest::state2string(State)  incomplete implementation
          Aborted (core dumped)
          

          Further analysis of the core files has shown that the requests objects were deleted while a callback function on those requests were initiated by the xrootd/ssi runtime.

          Show
          gapon Igor Gaponenko added a comment - - edited Reproducing the problem A special test application has been implemented to study the problem in an isolated (from the Master Controller) context. The application allows launching the specified number of the "echo"-type requests to a worker while limiting the life expectancy of the requests. There is an additional option that limits the total number of "in-flight" requests at each moment of time. The general idea here is to load the worker (or the xrootd/ssi network) to a level at which the processing time of requests would exceed the specified timeout (the "life expectancy") which would result in the client-side request cancellation. This will create the desired condition when the ssi callback would be invoked on a request deleted due to its expiration. The application would act like unit tests, though unlike the latter it requires a running worker server. The application will be used to test an improved version of the code. Running the test Here is how the application would be run from inside the Replication container (insignificant options are ommited): /qserv/bin/qserv-replica-worker-ping \ db01 \ "TEST-STRING-TO-BE-ECHOED-BACK-AND-MAKE-IT-AS-LONG-AS-POSSIBLE" \ --expiration-ival-sec=1 \ --num-requests=2000000 \ --max-requests=100000 Where: option description expiration-ival-sec An expiration interval of the requests num-requests The total number of requests to be processed max-requests The maximum number of the "in-flight" requests at each time Results Running the above-shown command (with the specified values of the parameters) results in the following crashes: [libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'lsst.qserv.proto.WorkerCommandTestEchoR.value' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. terminate called after throwing an instance of 'std::logic_error' what(): QservMgtRequest::state2string(ExtendedState) incomplete implementation Aborted (core dumped) terminate called after throwing an instance of 'std::logic_error' what(): QservMgtRequest::state2string(State) incomplete implementation Aborted (core dumped) Further analysis of the core files has shown that the requests objects were deleted while a callback function on those requests were initiated by the xrootd/ssi runtime.
          Hide
          jgates John Gates added a comment -

          Looks good, just a minor comment.

          Show
          jgates John Gates added a comment - Looks good, just a minor comment.

            People

            Assignee:
            gapon Igor Gaponenko
            Reporter:
            gapon Igor Gaponenko
            Reviewers:
            John Gates
            Watchers:
            Andy Hanushevsky, Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: