Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-26859

Test Qserv with multiple redirectors per master node

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Qserv
    • Labels:
      None
    • Story Points:
      4
    • Sprint:
      DB_F20_09, DB_S21_12
    • Team:
      Data Access and Database
    • Urgent?:
      No

      Description

      Objectives

      This effort has two goals:

      • testing if Qserv would work and it would be stable in a configuration which has more than one XRootD redirector per one master node
      • evaluating potential performance and stability benefits of this architecture

      All observations, results would have to be documented as comments posted along the line of this ticket, or documents attached to the ticket.

      This ticket doesn't have any code. Hence any reviewing (if any) should be done based on procedures and results explained hereafter.

      Setup

      All tests will be implemented ad-hoc using an existing Docker container:

      • the base container deps_20200806_2043:
      • Qserv container tag: qserv/qserv:tickets_DM-26609
      • XRootD version: lsst-dev-gdbde5f0c33 (obtained from eups list | grep xroot run inside the container). This corresponds to the "affinity" branch of the software: https://github.com/xrootd/xrootd/tree/affinity

      Note that the container did not build specifically for this investigation. It was used here only because it already has a desired version of XRootD which supports many redirectors per host

      Instructions from Andy H.

      I just pushed the changes to allow Igor to run multiple redirectors on the same host with host networking.
      To remind you what you need to do:
      a) Start each redirector (xroot-cmsd pair) with a different instance name on the same machine.
      b) You specify the instance name using the -n command line option (e.g. -n redir1, -n redir2, etc).
      c) List each redirector using all.manager directive but now you will add a scope. So, let say you have
        two cmsd's with instance names redir1 and redir2 and each runs at port 1213 and 1214, respectively. 
        The directives would be specified as:
          all.manager all host:1213%redir1
          all.manager all host:1214%redir2
      d) Each xrootd, of course, need to have it's unique port number as well. That can either be passed via
        the command line or use the xrd.port directive. The easiest way to do this is (for this example):
          xrd.port 1094 if named redir1
          xrd.port 1095 if named redir2
      d) There is a call in qserv to XrdSsiProvider::GetService() with one of the arguments being "contact" whose contents
        are provided by the qserv config file. The contents passed should now look like "host:1094,host:1095".
        I think this is a simple change in the server config. That's all you need to do to run as many redirectors on the same
        host as you please. Of course, you need to rebuild the container with what is in the affinity branch.
      I hope this makes it all much easier.
      

        Attachments

          Activity

          Hide
          gapon Igor Gaponenko added a comment - - edited

          A crash in mysql-proxy

          Qserv "czar* crashed on one occasion while processing multiple parallel queries. The crash stack reported by*gdb* has:

          Program terminated with signal SIGSEGV, Segmentation fault.
          #0  0xffffffffffffffff in ?? ()
          [Current thread is 1 (Thread 0x7f284d071740 (LWP 107124))]
          (gdb) where
          #0  0xffffffffffffffff in ?? ()
          #1  0x00007f284ae258cb in log4cxx::Logger::isInfoEnabled() const () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10
          #2  0x00007f284be7952a in lsst::log::Log::isInfoEnabled (
              this=0x7f284c2063e0 <(anonymous namespace)::QservLogger(timeval const&, unsigned long, char const*, int)::myLog>)
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/log/20.0.0-1-gd1c87d7+add1f556b4/include/lsst/log/Log.h:736
          #3  (anonymous namespace)::QservLogger (mtime=..., tID=107124, msg=0x55d74fa617d0 "Stopping the job manager...\n", mlen=28) at core/modules/czar/CzarConfig.cc:52
          #4  0x00007f284b58b840 in (anonymous namespace)::LogMCB::Write (this=0x55d74fb18d40, msg=...)
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSsi/XrdSsiLogger.cc:111
          #5  0x00007f284ac4518b in XrdCl::Log::Say (this=this@entry=0x55d74fa89770, level=level@entry=XrdCl::Log::DebugMsg, topic=topic@entry=1024, 
              format=format@entry=0x7f284ad0ca73 "Stopping the job manager...", list=list@entry=0x7ffc50e5f1f0)
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClLog.cc:155
          #6  0x00007f284ac45850 in XrdCl::Log::Debug (this=this@entry=0x55d74fa89770, topic=topic@entry=1024, 
              format=format@entry=0x7f284ad0ca73 "Stopping the job manager...")
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClLog.cc:291
          #7  0x00007f284acda5fd in XrdCl::JobManager::Stop (this=0x55d74fa91d40)
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClJobManager.cc:96
          #8  0x00007f284ac639de in XrdCl::PostMaster::Stop (this=0x55d74fa906e0)
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPostMaster.cc:144
          #9  0x00007f284ac5084b in XrdCl::DefaultEnv::Finalize ()
              at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClDefaultEnv.cc:703
          #10 0x00007f284ca9dc99 in __run_exit_handlers () from /lib64/libc.so.6
          #11 0x00007f284ca9dce7 in exit () from /lib64/libc.so.6
          #12 0x00007f284ca8650c in __libc_start_main () from /lib64/libc.so.6
          #13 0x000055d74dd2c449 in _start () at ../sysdeps/x86_64/elf/start.S:103
          

          A reason of the crash is not understood. There is a theory (reported by colleagues) that the processes called std::exit which resulted in "forking" the processes. This was intercepted by XRootD and caused the crash due to improperly initialized state of the copy.

          A decision was made not to pursue this investigation unless a similar crash will be reported.

          Show
          gapon Igor Gaponenko added a comment - - edited A crash in mysql-proxy Qserv "czar* crashed on one occasion while processing multiple parallel queries. The crash stack reported by*gdb* has: Program terminated with signal SIGSEGV, Segmentation fault. #0 0xffffffffffffffff in ?? () [Current thread is 1 (Thread 0x7f284d071740 (LWP 107124))] (gdb) where #0 0xffffffffffffffff in ?? () #1 0x00007f284ae258cb in log4cxx::Logger::isInfoEnabled() const () from /qserv/stack/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-ceb6bb6/lib/liblog4cxx.so.10 #2 0x00007f284be7952a in lsst::log::Log::isInfoEnabled ( this=0x7f284c2063e0 <(anonymous namespace)::QservLogger(timeval const&, unsigned long, char const*, int)::myLog>) at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/Linux64/log/20.0.0-1-gd1c87d7+add1f556b4/include/lsst/log/Log.h:736 #3 (anonymous namespace)::QservLogger (mtime=..., tID=107124, msg=0x55d74fa617d0 "Stopping the job manager...\n", mlen=28) at core/modules/czar/CzarConfig.cc:52 #4 0x00007f284b58b840 in (anonymous namespace)::LogMCB::Write (this=0x55d74fb18d40, msg=...) at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdSsi/XrdSsiLogger.cc:111 #5 0x00007f284ac4518b in XrdCl::Log::Say (this=this@entry=0x55d74fa89770, level=level@entry=XrdCl::Log::DebugMsg, topic=topic@entry=1024, format=format@entry=0x7f284ad0ca73 "Stopping the job manager...", list=list@entry=0x7ffc50e5f1f0) at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClLog.cc:155 #6 0x00007f284ac45850 in XrdCl::Log::Debug (this=this@entry=0x55d74fa89770, topic=topic@entry=1024, format=format@entry=0x7f284ad0ca73 "Stopping the job manager...") at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClLog.cc:291 #7 0x00007f284acda5fd in XrdCl::JobManager::Stop (this=0x55d74fa91d40) at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClJobManager.cc:96 #8 0x00007f284ac639de in XrdCl::PostMaster::Stop (this=0x55d74fa906e0) at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClPostMaster.cc:144 #9 0x00007f284ac5084b in XrdCl::DefaultEnv::Finalize () at /qserv/stack/stack/miniconda3-py37_4.8.2-ceb6bb6/EupsBuildDir/Linux64/xrootd-lsst-dev-gdbde5f0c33/xrootd-lsst-dev-gdbde5f0c33/src/XrdCl/XrdClDefaultEnv.cc:703 #10 0x00007f284ca9dc99 in __run_exit_handlers () from /lib64/libc.so.6 #11 0x00007f284ca9dce7 in exit () from /lib64/libc.so.6 #12 0x00007f284ca8650c in __libc_start_main () from /lib64/libc.so.6 #13 0x000055d74dd2c449 in _start () at ../sysdeps/x86_64/elf/start.S:103 A reason of the crash is not understood. There is a theory (reported by colleagues) that the processes called std::exit which resulted in "forking" the processes. This was intercepted by XRootD and caused the crash due to improperly initialized state of the copy. A decision was made not to pursue this investigation unless a similar crash will be reported.
          Hide
          salnikov Andy Salnikov added a comment - - edited

          Sorry I could not stay longer for discussion. My understanding is that there is no indication of forking but Andy Hanushevsky said that cleanup at regular exit does the same thing as pre-fork cleanup, so this probably is the reason for confusion. My guess about what had actually happened is that at-exit cleanup ran fine but it tried to use logging service and it is likely that logging service was already destroyed. I don't think we can order at-exit cleanups in any reasonable way, so at-exit cleanup should be really careful about what it does and don't try to use services that are potentially dead. Meaning that logging that happens in XrdCl::JobManager::Stop should not go to QServ logger but to standard output. It may be easy to reproduce this crash by stopping qserv (possibly many times) as that logging message seem to be a regular message that xrootd issues at shutdown.

          Show
          salnikov Andy Salnikov added a comment - - edited Sorry I could not stay longer for discussion. My understanding is that there is no indication of forking but Andy Hanushevsky said that cleanup at regular exit does the same thing as pre-fork cleanup, so this probably is the reason for confusion. My guess about what had actually happened is that at-exit cleanup ran fine but it tried to use logging service and it is likely that logging service was already destroyed. I don't think we can order at-exit cleanups in any reasonable way, so at-exit cleanup should be really careful about what it does and don't try to use services that are potentially dead. Meaning that logging that happens in XrdCl::JobManager::Stop should not go to QServ logger but to standard output. It may be easy to reproduce this crash by stopping qserv (possibly many times) as that logging message seem to be a regular message that xrootd issues at shutdown.
          Hide
          abh Andy Hanushevsky added a comment - - edited

          Two comments:

          1) The multiple redirector setup is fine as long as you specify "all..manager all" is specified; otherwise it will default s to use the additional managers as hot spare (yea, stupid default)s. As for logging services, XRootD is completely devoid or static teardown issues. So, look elsewhere when it comes to that.

          2) Be ware that XRootD merely forwards messages to the subscribed interface (log4xx) So, if you are deleting the logging object you really have to tell XRootD about it if it was registere as a call-back agent. That is, you need to cancel the callback before deleting callback object.

          Show
          abh Andy Hanushevsky added a comment - - edited Two comments: 1) The multiple redirector setup is fine as long as you specify "all..manager all" is specified; otherwise it will default s to use the additional managers as hot spare (yea, stupid default)s. As for logging services, XRootD is completely devoid or static teardown issues. So, look elsewhere when it comes to that. 2) Be ware that XRootD merely forwards messages to the subscribed interface (log4xx) So, if you are deleting the logging object you really have to tell XRootD about it if it was registere as a call-back agent. That is, you need to cancel the callback before deleting callback object.
          Hide
          gapon Igor Gaponenko added a comment -

          Andy Hanushevsky Thank you for the comments!
          Fritz Mueller I would like to close this ticket. It was a small R&D to study that has proven to be quite useful in understanding the effects of an architectural option for Qserv. The ticket also records what needs to be done to configure Qserv (and the Replication System) should we choose to do so.

          Show
          gapon Igor Gaponenko added a comment - Andy Hanushevsky Thank you for the comments! Fritz Mueller I would like to close this ticket. It was a small R&D to study that has proven to be quite useful in understanding the effects of an architectural option for Qserv. The ticket also records what needs to be done to configure Qserv (and the Replication System) should we choose to do so.
          Hide
          fritzm Fritz Mueller added a comment -

          Please feel free to close the ticket, Igor Gaponenko.

          Show
          fritzm Fritz Mueller added a comment - Please feel free to close the ticket, Igor Gaponenko .

            People

            Assignee:
            gapon Igor Gaponenko
            Reporter:
            gapon Igor Gaponenko
            Reviewers:
            Andy Hanushevsky
            Watchers:
            Andy Hanushevsky, Andy Salnikov, Fritz Mueller, Igor Gaponenko, Nate Pease
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: