# Master controller crashes in XRootD stack

XMLWordPrintable

#### Details

• Type: Bug
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Story Points:
4
• Sprint:
DB_F20_09, DB_S21_12
• Team:
Data Access and Database
• Urgent?:
No

#### Description

The controller crashed while communicating with XRootD services and reported:

 Program terminated with signal SIGSEGV, Segmentation fault. 

The stack of the crash s presented below:

 (gdb) where #0 XrdCl::PollerBuiltIn::RemoveSocket (this=0x559a5c5f5230, socket=0x7f52990d8380)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPollerBuiltIn.cc:330 #1 0x00007f5611f8c502 in XrdCl::AsyncSocketHandler::Close (this=0x7f52990eb5f0)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:197 #2 0x00007f5611f20419 in XrdCl::Stream::ForceError (this=0x7f52990290b0, status=...)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClStream.cc:863 #3 0x00007f5611f1d7cf in XrdCl::Channel::ForceDisconnect (this=0x7f52990eb180)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClChannel.cc:362 #4 0x00007f5611f1bed9 in XrdCl::PostMaster::ForceDisconnect (this=0x559a5c5f3370, url=...)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPostMaster.cc:292 #5 0x00007f5611f21742 in XrdCl::Stream::OnReadTimeout (this=0x7f52990290b0, substream=, isBroken=@0x7f560c981cb7: false)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClStream.cc:1013 #6 0x00007f5611f8d1f1 in XrdCl::AsyncSocketHandler::OnReadTimeout (this=0x7f52990eb5f0)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:932 #7 0x00007f5611f8ef3d in XrdCl::AsyncSocketHandler::Event (this=0x7f52990eb5f0, type=)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClAsyncSocketHandler.cc:243 #8 0x00007f5611f19968 in (anonymous namespace)::SocketCallBack::Event (this=0x7f5298a503f0, chP=, cbArg=, evFlags=)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPollerBuiltIn.cc:82 #9 0x00007f5610057024 in XrdSys::IOEvents::Poller::CbkXeq (this=this@entry=0x559a5c5f9010, cP=cP@entry=0x7f52990282d0, events=events@entry=2, eNum=,   eNum@entry=0, eTxt=eTxt@entry=0x0)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:693 #10 0x00007f5610057731 in XrdSys::IOEvents::Poller::CbkTMO (this=0x559a5c5f9010)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:598 #11 0x00007f5610058849 in XrdSys::IOEvents::PollE::Begin (this=0x559a5c5f9010, syncsem=, retcode=, eTxt=)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/./XrdSys/XrdSysIOEventsPollE.icc:217 #12 0x00007f561005496e in XrdSys::IOEvents::BootStrap::Start (parg=0x7ffc39f32d00)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysIOEvents.cc:131 #13 0x00007f561005db08 in XrdSysThread_Xeq (myargs=0x559a5c5f90e0)  at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSys/XrdSysPthread.cc:86 #14 0x00007f56104b9e65 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f56101e288d in clone () from /lib64/libc.so.6 

The crash happended in the replication system's container qserv/replica:tools-DM-27091 based on qserv/qserv:deps_20200901_0156.
The core file was captured in teh container snapshot qserv/replica:tools-DM-27091-xrootd-crash. This container already has gdb installed
Here is how to obtain the crash stack:

 docker pull qserv/replica:tools-DM-27091-xrootd-crash docker run -it --rm --privileged --cap-add=SYS_PTRACE qserv/replica:tools-DM-27091-xrootd-crash lsst bash 

The last command will launch bash inside the conateiner in the LSST Stack environment. At this step one will also see the following complaines, which can be safely ignored:

 ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-cc ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-gfortran ERROR: This cross-compiler package contains no program /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/bin/x86_64-conda-linux-gnu-c++ ERROR: activate-gxx_linux-64.sh failed, see above for details 

Launch gdb inside the container:

 gdb /qserv/bin/qserv-replica-master-http /tmp/core.439 

#### Activity

Hide
Andy Hanushevsky added a comment -

In closing (as I really have to get to sleep), I would stay away from any package that deems itself as the "master of the universe". It's a sure way to lock yourself into hell. The SSI framework was never written that way. The package simply assumes that there is a transport that provides the needed functionality (and most transports do). So, you could substitute any transport you wanted by simply coding to the virtual interface. That was the whole idea (and requirement by Daniel) that we didn't get locked into any particular underlying framework. The application did not need to change just because you decided to use x instead of xrootd. In this sense I think the SSI framework succeeded quite admirably.

Show
Andy Hanushevsky added a comment - In closing (as I really have to get to sleep), I would stay away from any package that deems itself as the "master of the universe". It's a sure way to lock yourself into hell. The SSI framework was never written that way. The package simply assumes that there is a transport that provides the needed functionality (and most transports do). So, you could substitute any transport you wanted by simply coding to the virtual interface. That was the whole idea (and requirement by Daniel) that we didn't get locked into any particular underlying framework. The application did not need to change just because you decided to use x instead of xrootd. In this sense I think the SSI framework succeeded quite admirably.
Hide
Andy Hanushevsky added a comment -

Hi mgmt. Ah, there is mutual respect here. I wouldn't be arguing with Igor if I didn't respect his comments. I would simply ignore him. This is Slavic way of doing things.

Show
Andy Hanushevsky added a comment - Hi mgmt. Ah, there is mutual respect here. I wouldn't be arguing with Igor if I didn't respect his comments. I would simply ignore him. This is Slavic way of doing things.
Hide
Andy Hanushevsky added a comment -

mgmt: Yes, I was likely more inflammatory than necessary. After reading through the posts again I see that Igor was actually looking for a timeout solution  (didn't get past the azio comment I suppose). Anyway, I think we probably resolved this, maybe. I will check back with Igor.

Show
Andy Hanushevsky added a comment - mgmt: Yes, I was likely more inflammatory than necessary. After reading through the posts again I see that Igor was actually looking for a timeout solution  (didn't get past the azio comment I suppose). Anyway, I think we probably resolved this, maybe. I will check back with Igor.
Hide
Igor Gaponenko added a comment - - edited

# Reproducing the problem

A special test application has been implemented to study the problem in an isolated (from the Master Controller) context. The application allows launching the specified number of the "echo"-type requests to a worker while limiting the life expectancy of the requests. There is an additional option that limits the total number of "in-flight" requests at each moment of time. The general idea here is to load the worker (or the xrootd/ssi network) to a level at which the processing time of requests would exceed the specified timeout (the "life expectancy") which would result in the client-side request cancellation. This will create the desired condition when the ssi callback would be invoked on a request deleted due to its expiration.

The application would act like unit tests, though unlike the latter it requires a running worker server. The application will be used to test an improved version of the code.

## Running the test

Here is how the application would be run from inside the Replication container (insignificant options are ommited):

 /qserv/bin/qserv-replica-worker-ping \  db01 \  "TEST-STRING-TO-BE-ECHOED-BACK-AND-MAKE-IT-AS-LONG-AS-POSSIBLE" \  --expiration-ival-sec=1 \  --num-requests=2000000 \  --max-requests=100000 

Where:

option description
expiration-ival-sec An expiration interval of the requests
num-requests The total number of requests to be processed
max-requests The maximum number of the "in-flight" requests at each time

## Results

Running the above-shown command (with the specified values of the parameters) results in the following crashes:

 [libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'lsst.qserv.proto.WorkerCommandTestEchoR.value' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.  terminate called after throwing an instance of 'std::logic_error'  what(): QservMgtRequest::state2string(ExtendedState) incomplete implementation Aborted (core dumped) 

 terminate called after throwing an instance of 'std::logic_error'  what(): QservMgtRequest::state2string(State) incomplete implementation Aborted (core dumped) 

Further analysis of the core files has shown that the requests objects were deleted while a callback function on those requests were initiated by the xrootd/ssi runtime.

Show
Igor Gaponenko added a comment - - edited Reproducing the problem A special test application has been implemented to study the problem in an isolated (from the Master Controller) context. The application allows launching the specified number of the "echo"-type requests to a worker while limiting the life expectancy of the requests. There is an additional option that limits the total number of "in-flight" requests at each moment of time. The general idea here is to load the worker (or the xrootd/ssi network) to a level at which the processing time of requests would exceed the specified timeout (the "life expectancy") which would result in the client-side request cancellation. This will create the desired condition when the ssi callback would be invoked on a request deleted due to its expiration. The application would act like unit tests, though unlike the latter it requires a running worker server. The application will be used to test an improved version of the code. Running the test Here is how the application would be run from inside the Replication container (insignificant options are ommited): /qserv/bin/qserv-replica-worker-ping \ db01 \ "TEST-STRING-TO-BE-ECHOED-BACK-AND-MAKE-IT-AS-LONG-AS-POSSIBLE" \ --expiration-ival-sec=1 \ --num-requests=2000000 \ --max-requests=100000 Where: option description expiration-ival-sec An expiration interval of the requests num-requests The total number of requests to be processed max-requests The maximum number of the "in-flight" requests at each time Results Running the above-shown command (with the specified values of the parameters) results in the following crashes: [libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'lsst.qserv.proto.WorkerCommandTestEchoR.value' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. terminate called after throwing an instance of 'std::logic_error' what(): QservMgtRequest::state2string(ExtendedState) incomplete implementation Aborted (core dumped) terminate called after throwing an instance of 'std::logic_error' what(): QservMgtRequest::state2string(State) incomplete implementation Aborted (core dumped) Further analysis of the core files has shown that the requests objects were deleted while a callback function on those requests were initiated by the xrootd/ssi runtime.
Hide
John Gates added a comment -

Looks good, just a minor comment.

Show
John Gates added a comment - Looks good, just a minor comment.

#### People

Assignee:
Igor Gaponenko
Reporter:
Igor Gaponenko
Reviewers:
John Gates
Watchers:
Andy Hanushevsky, Fritz Mueller, Igor Gaponenko, John Gates, Nate Pease
0 Vote for this issue
Watchers:
5 Start watching this issue

#### Dates

Created:
Updated:
Resolved: