Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-25986

Master replication controller crashes in the Kubernetes environment

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None

      Description

      Repeated crashes of the Master Replication Controller have been reported at IN2P3 in the Kubernetes-based Qserv installation which also includes the Replication/Ingest system. The crashes are seen in two scenarios:

      • when starting the Qserv instance the Controller may crash a few times before getting into a stable state
      • when ingesting new catalogs (typically when committing super-transactions)

      In both cases the following messages were reported by the Controller:

      terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
        what():  cancel: Bad file descriptor
      /config-start/start.sh: line 83:    10 Aborted                 (core dumped) qserv-replica-master-http ${PARAMETERS} --config="${CONFIG}" --qserv-db-password="${MYSQL_ROOT_PASSWORD}"
      

      A goal of this effort is to investigate a reason of the crash and fix/reinforce the Controller's implementation.

        Attachments

          Issue Links

            Activity

            Hide
            gapon Igor Gaponenko added a comment -

            Fabrice Jammes yes, I'm aware about this problem. Apparently, the current implementation of the worker Ingest services has a race condition. Unfortunately, I have not been able to identify its source yet. At the mean time, I have registered this problem in the following issue DM-26034. The issue is also linked to your main ticket DM-24587.

            Show
            gapon Igor Gaponenko added a comment - Fabrice Jammes yes, I'm aware about this problem. Apparently, the current implementation of the worker Ingest services has a race condition. Unfortunately, I have not been able to identify its source yet. At the mean time, I have registered this problem in the following issue DM-26034 . The issue is also linked to your main ticket DM-24587 .
            Hide
            gapon Igor Gaponenko added a comment - - edited

            Found a source of the crash

            The crash in the Controller has been reproduced in the non-k8s environment by:

            • configuring the Replication system with an extra worker to be run on a non-existing (in DNS) node, and
            • launching the Controller via gdb

            Here is a fragment of the gdb stack trace at a point of the crash. The crash happened shortly after starting the application:

            Thread 106 "qserv-replica-m" received signal SIGABRT, Aborted.
            [Switching to Thread 0x7fec72343700 (LWP 3819)]
            0x00007feca9d9e337 in raise () from /lib64/libc.so.6
            (gdb) where
            #0  0x00007feca9d9e337 in raise () from /lib64/libc.so.6
            #1  0x00007feca9d9fa28 in abort () from /lib64/libc.so.6
            #2  0x00007fecaa71581c in __gnu_cxx::__verbose_terminate_handler ()
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
            #3  0x00007fecaa713f19 in __cxxabiv1::__terminate (handler=<optimized out>)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
            #4  0x00007fecaa713f4f in std::terminate ()
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
            #5  0x00007fecaa71412c in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fea68004d00, tinfo=tinfo@entry=0x7fecab979e08 <typeinfo for boost::wrapexcept<boost::system::system_error>>, 
                dest=dest@entry=0x7fecab44fb72 <boost::wrapexcept<boost::system::system_error>::~wrapexcept()>)
                at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
            #6  0x00007fecab455499 in boost::throw_exception<boost::system::system_error> (e=...) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/throw_exception.hpp:70
            #7  0x00007fecab53be19 in boost::asio::detail::do_throw_error (err=..., location=location@entry=0x7fecab87c734 "cancel")
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/detail/impl/throw_error.ipp:38
            #8  0x00007fecab6569e5 in boost::asio::detail::throw_error (location=0x7fecab87c734 "cancel", err=...)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/detail/throw_error.hpp:42
            #9  boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::executor>::cancel (this=this@entry=0x55a32c8d7528)
                at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/basic_socket.hpp:649
            #10 0x00007fecab651f27 in lsst::qserv::replica::MessengerConnector::_restart (this=this@entry=0x55a32c8d7360, lock=...) at core/modules/replica/MessengerConnector.cc:229
            #11 0x00007fecab652755 in lsst::qserv::replica::MessengerConnector::_awakenForRestart (this=0x55a32c8d7360, ec=...) at core/modules/replica/MessengerConnector.cc:338
            ...
            

            Further analysis of the code revealed a bug in the implementation of method:

            lsst::qserv::replica::MessengerConnector::_restart
            

            The method didn't properly address missing DNS entries.

            This is a recipe for reproducing the crash at NCSA (the "big" Qserv cluster)

            Add the following "fake" worker to the configuration:

            INSERT INTO config_worker VALUES(
                'db40',1,0,
                'lsst-qserv-db40',NULL,
                'lsst-qserv-db40',NULL,NULL,
                'localhost',NULL,NULL,
                'lsst-qserv-db40',NULL,NULL,
                'lsst-qserv-db40',NULL,NULL
            );
            

            Start the docker container in the "interactive" mode and the LSST Stack environment to allow debugging:

            docker run -it \
                --privileged --cap-add=SYS_PTRACE \
                --network host \
                --name qserv-replica-master-http \
                -u 1000:1000 -v /etc/passwd:/etc/passwd:ro \
                -v /qserv/qservprod/replication/work:/qserv/replication/work \
                -v /qserv/qserv-prod/data/qserv:/qserv/data/qserv \
                -v /qserv/qserv-prod/data/ingest:/qserv/data/ingest \
                -v /qserv/qserv-prod/replication/config:/qserv/replication/config:ro \
                -v /qserv/qserv-prod/replication/log:/qserv/replication/log \
                -e LSST_LOG_CONFIG=/qserv/replication/config/log4cxx.replication.properties \
                qserv/replica:tools-DM-25961 lsst bash
            

            Inside the container:

            conda install gdb
            gdb /qserv/bin/qserv-replica-master-http
            

            Inside gdb:

            run --config=mysql://qsreplica@lsst-qserv-master01:23306/qservReplica --instance-id=qserv-prod --qserv-db-password=xxx --auth-key=xxx --debug
            

            Show
            gapon Igor Gaponenko added a comment - - edited Found a source of the crash The crash in the Controller has been reproduced in the non- k8s environment by: configuring the Replication system with an extra worker to be run on a non-existing (in DNS) node, and launching the Controller via gdb Here is a fragment of the gdb stack trace at a point of the crash. The crash happened shortly after starting the application: Thread 106 "qserv-replica-m" received signal SIGABRT, Aborted. [Switching to Thread 0x7fec72343700 (LWP 3819)] 0x00007feca9d9e337 in raise () from /lib64/libc.so.6 (gdb) where #0 0x00007feca9d9e337 in raise () from /lib64/libc.so.6 #1 0x00007feca9d9fa28 in abort () from /lib64/libc.so.6 #2 0x00007fecaa71581c in __gnu_cxx::__verbose_terminate_handler () at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95 #3 0x00007fecaa713f19 in __cxxabiv1::__terminate (handler=<optimized out>) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47 #4 0x00007fecaa713f4f in std::terminate () at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57 #5 0x00007fecaa71412c in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fea68004d00, tinfo=tinfo@entry=0x7fecab979e08 <typeinfo for boost::wrapexcept<boost::system::system_error>>, dest=dest@entry=0x7fecab44fb72 <boost::wrapexcept<boost::system::system_error>::~wrapexcept()>) at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638331887/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95 #6 0x00007fecab455499 in boost::throw_exception<boost::system::system_error> (e=...) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/throw_exception.hpp:70 #7 0x00007fecab53be19 in boost::asio::detail::do_throw_error (err=..., location=location@entry=0x7fecab87c734 "cancel") at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/detail/impl/throw_error.ipp:38 #8 0x00007fecab6569e5 in boost::asio::detail::throw_error (location=0x7fecab87c734 "cancel", err=...) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/detail/throw_error.hpp:42 #9 boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::executor>::cancel (this=this@entry=0x55a32c8d7528) at /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/include/boost/asio/basic_socket.hpp:649 #10 0x00007fecab651f27 in lsst::qserv::replica::MessengerConnector::_restart (this=this@entry=0x55a32c8d7360, lock=...) at core/modules/replica/MessengerConnector.cc:229 #11 0x00007fecab652755 in lsst::qserv::replica::MessengerConnector::_awakenForRestart (this=0x55a32c8d7360, ec=...) at core/modules/replica/MessengerConnector.cc:338 ... Further analysis of the code revealed a bug in the implementation of method: lsst::qserv::replica::MessengerConnector::_restart The method didn't properly address missing DNS entries. This is a recipe for reproducing the crash at NCSA (the "big" Qserv cluster) Add the following "fake" worker to the configuration: INSERT INTO config_worker VALUES ( 'db40' ,1,0, 'lsst-qserv-db40' , NULL , 'lsst-qserv-db40' , NULL , NULL , 'localhost' , NULL , NULL , 'lsst-qserv-db40' , NULL , NULL , 'lsst-qserv-db40' , NULL , NULL ); Start the docker container in the "interactive" mode and the LSST Stack environment to allow debugging: docker run -it \ --privileged --cap-add=SYS_PTRACE \ --network host \ --name qserv-replica-master-http \ -u 1000:1000 - v /etc/passwd : /etc/passwd :ro \ - v /qserv/qservprod/replication/work : /qserv/replication/work \ - v /qserv/qserv-prod/data/qserv : /qserv/data/qserv \ - v /qserv/qserv-prod/data/ingest : /qserv/data/ingest \ - v /qserv/qserv-prod/replication/config : /qserv/replication/config :ro \ - v /qserv/qserv-prod/replication/log : /qserv/replication/log \ -e LSST_LOG_CONFIG= /qserv/replication/config/log4cxx .replication.properties \ qserv /replica :tools-DM-25961 lsst bash Inside the container: conda install gdb gdb /qserv/bin/qserv-replica-master-http Inside gdb : run --config=mysql://qsreplica@lsst-qserv-master01:23306/qservReplica --instance-id=qserv-prod --qserv-db-password=xxx --auth-key=xxx --debug
            Show
            gapon Igor Gaponenko added a comment - PR: https://github.com/lsst/qserv/pull/565
            Hide
            fritzm Fritz Mueller added a comment -

            LGTM. Comments on PR.

            Show
            fritzm Fritz Mueller added a comment - LGTM. Comments on PR.
            Hide
            jammes Fabrice Jammes added a comment - - edited

            Igor Gaponenko, in my understanding, above flag `cap-add=SYS_PTRACE` is already enabled by `privileged` flag, which enable all linux kernel capabilities. Do you need `privileged` for other capabilities, or only `SYS_PTRACE`?

            Show
            jammes Fabrice Jammes added a comment - - edited Igor Gaponenko , in my understanding, above flag `cap-add=SYS_PTRACE` is already enabled by `privileged` flag, which enable all linux kernel capabilities. Do you need `privileged` for other capabilities, or only `SYS_PTRACE`?

              People

              Assignee:
              gapon Igor Gaponenko
              Reporter:
              gapon Igor Gaponenko
              Reviewers:
              Fritz Mueller
              Watchers:
              Dominique Boutigny, Fabrice Jammes, Fritz Mueller, Igor Gaponenko, Nate Pease, Sabine Elles
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: