Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-11997

Create and load 30% DR1 test dataset at in2p3

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A 30% DR1 dataset is needed at IN2P3 to use as a basis for the F17 KPM measurements.

      Given the amount of RAM on the computers, the chunks are already larger than they should be. The data will need to be partitioned into smaller chunks.

        Attachments

          Issue Links

            Activity

            Hide
            jgates John Gates added a comment -

            For the query SELECT COUNT( * ) FROM ForcedSource, the first time qserv tries it, it slowly works it way through each chunk. On the workers, the first time through the query takes 5-10 minutes. After this first time, the count( * ) queries happen immediately. Mariadb is writing something to the ForcedSource_*.MYI files and once that is done the queries are fast.

            More concerning, the czar works its way through these queries for a few hours and then dies. The workers see this and stop functioning with this message

            [2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/37807 [bound odRsp] Calling Finished(False)
            [2018-01-12T21:31:18.614Z] [LWP:42] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession()
            [2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/37807 [bound odRsp] Recycling request...
            [2018-01-12T21:31:39.231Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - Login: Primary server 30 logged out
            [2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - 1 message lost!
            [2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - State: Status changed to suspended
            [2018-01-12T21:43:48.301Z] [LWP:97] INFO xrdssi.msgs (cmsd:0) - Manager: manager.0:26@10.158.37.100 removed; request read failed
            

            The czar's log has heaps upon heaps of messages like "[2018-01-12T21:43:47.179Z] [LWP:327] INFO xrdssi.msgs (client:0) - Registering task: "WaitTask for: 0x0x7f49e06c95a0" to be run at: [2018-01-12 21:43:57 +0000]". More than 2000 "WaitTask" related message in a second. After 2018-01-12T21:39:00.142Z, there appears to be nothing in the log file except for "WaitTask" messages. There are no errors, warnings, or indications that I have found that anything has changed near this point, but the czar is no longer communicating with workers, doing nothing but WaitTask related work.

            Both the czar and worker behavior are worrisome. The czar should not be going down, while the workers should keep going, killing tasks as appropriate.

            Show
            jgates John Gates added a comment - For the query SELECT COUNT( * ) FROM ForcedSource, the first time qserv tries it, it slowly works it way through each chunk. On the workers, the first time through the query takes 5-10 minutes. After this first time, the count( * ) queries happen immediately. Mariadb is writing something to the ForcedSource_*.MYI files and once that is done the queries are fast. More concerning, the czar works its way through these queries for a few hours and then dies. The workers see this and stop functioning with this message [2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/37807 [bound odRsp] Calling Finished(False) [2018-01-12T21:31:18.614Z] [LWP:42] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession() [2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/37807 [bound odRsp] Recycling request... [2018-01-12T21:31:39.231Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - Login: Primary server 30 logged out [2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - 1 message lost! [2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - State: Status changed to suspended [2018-01-12T21:43:48.301Z] [LWP:97] INFO xrdssi.msgs (cmsd:0) - Manager: manager.0:26@10.158.37.100 removed; request read failed The czar's log has heaps upon heaps of messages like " [2018-01-12T21:43:47.179Z] [LWP:327] INFO xrdssi.msgs (client:0) - Registering task: "WaitTask for: 0x0x7f49e06c95a0" to be run at: [2018-01-12 21:43:57 +0000] ". More than 2000 "WaitTask" related message in a second. After 2018-01-12T21:39:00.142Z, there appears to be nothing in the log file except for "WaitTask" messages. There are no errors, warnings, or indications that I have found that anything has changed near this point, but the czar is no longer communicating with workers, doing nothing but WaitTask related work. Both the czar and worker behavior are worrisome. The czar should not be going down, while the workers should keep going, killing tasks as appropriate.
            Hide
            jgates John Gates added a comment - - edited

            Ganglia shows that on the attempt in the above comment, a few workers essential stopped doing anything shortly after the query started. ccqserv108, ccqserv106, ccqserv114, ccqserv116 being examples of workers whose load dropped to 0 within 20 minutes of the query starting.

            Looking at ccqserv108, it lost its connection to the czar and gave up a couple of hours before ccqserv110 did

            [2018-01-12T19:27:39.121Z] [LWP:40] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:181) - RequestFinished type=isStream
            [2018-01-12T19:27:39.121Z] [LWP:40] INFO  xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/20229 [bound odRsp] Calling Finished(False)
            [2018-01-12T19:27:39.121Z] [LWP:40] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession()
            [2018-01-12T19:27:39.121Z] [LWP:40] INFO  xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/20229 [bound odRsp] Recycling request...
            [2018-01-12T19:27:39.161Z] [LWP:115] INFO  xrdssi.msgs (cmsd:0) - Login: Primary server 29 logged out
            [2018-01-12T19:27:39.161Z] [LWP:116] INFO  xrdssi.msgs (cmsd:0) - 1 message lost!
            [2018-01-12T19:27:39.161Z] [LWP:116] INFO  xrdssi.msgs (cmsd:0) - State: Status changed to suspended
            
            

            Show
            jgates John Gates added a comment - - edited Ganglia shows that on the attempt in the above comment, a few workers essential stopped doing anything shortly after the query started. ccqserv108, ccqserv106, ccqserv114, ccqserv116 being examples of workers whose load dropped to 0 within 20 minutes of the query starting. Looking at ccqserv108, it lost its connection to the czar and gave up a couple of hours before ccqserv110 did [2018-01-12T19:27:39.121Z] [LWP:40] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:181) - RequestFinished type=isStream [2018-01-12T19:27:39.121Z] [LWP:40] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/20229 [bound odRsp] Calling Finished(False) [2018-01-12T19:27:39.121Z] [LWP:40] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession() [2018-01-12T19:27:39.121Z] [LWP:40] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/20229 [bound odRsp] Recycling request... [2018-01-12T19:27:39.161Z] [LWP:115] INFO xrdssi.msgs (cmsd:0) - Login: Primary server 29 logged out [2018-01-12T19:27:39.161Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - 1 message lost! [2018-01-12T19:27:39.161Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - State: Status changed to suspended
            Hide
            jgates John Gates added a comment -

            It looks like the "WaitTask" messages in the log only start to appear after one of the workers has gone to the "Status changed to suspended" state. The frequency of "WaitTask" appears to increase with time, even if only one worker has gone down. xrootd has gone down and produced a core file on ccqserv122.

            Program terminated with signal SIGSEGV, Segmentation fault.
            #0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
            66	../nptl/pthread_mutex_lock.c: No such file or directory.
            (gdb) backtrace
            #0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
            #1 0x00007f1e7310933e in XrdSsiMutex::Lock (this=0x0)
            at /qserv/stack/stack/miniconda3-4.3.21-10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/./XrdSsi/XrdSsiAtomics.hh:116
            #2 0x00007f1e7310bfc7 in XrdSsiRequest::Finished (this=0x7f1e24099a20, 
            cancel=false)
            at /qserv/stack/stack/miniconda3-4.3.21-10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiRequest.cc:99
            #3 0x00007f1e6e918746 in XrdSsiFileReq::Finalize (this=0x7f1e24099a20)
            at /qserv/stack/stack/miniconda3-4.3.21-10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFileReq.cc:444
            #4 0x00007f1e6e91caac in XrdSsiFileSess::read (this=0x7f1e240993a0, 
            offset=0, buff=0x7f1e3c002400 "\b", blen=67)
            at /qserv/stack/stack/miniconda3-4.3.21-10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFileSess.cc:505
            #5 0x00007f1e6e91631c in XrdSsiFile::read (this=0x7f1e24098ad0, offset=0, 
            buff=0x7f1e3c002400 "\b", blen=67)
            at /qserv/stack/stack/miniconda3-4.3.21-10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFile.cc:406
            #6 0x00007f1e74c64c19 in XrdXrootdProtocol::do_ReadAll (this=0x7f1e3c001fe0, 
            asyncOK=1)

            I've moved the core file to ccqserv122:/qserv/log/.

            Show
            jgates John Gates added a comment - It looks like the "WaitTask" messages in the log only start to appear after one of the workers has gone to the "Status changed to suspended" state. The frequency of "WaitTask" appears to increase with time, even if only one worker has gone down. xrootd has gone down and produced a core file on ccqserv122. Program terminated with signal SIGSEGV, Segmentation fault. # 0 __GI___pthread_mutex_lock (mutex= 0x0 ) at ../nptl/pthread_mutex_lock.c: 66 66 ../nptl/pthread_mutex_lock.c: No such file or directory. (gdb) backtrace # 0 __GI___pthread_mutex_lock (mutex= 0x0 ) at ../nptl/pthread_mutex_lock.c: 66 # 1 0x00007f1e7310933e in XrdSsiMutex::Lock ( this = 0x0 ) at /qserv/stack/stack/miniconda3- 4.3 . 21 -10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/./XrdSsi/XrdSsiAtomics.hh: 116 # 2 0x00007f1e7310bfc7 in XrdSsiRequest::Finished ( this = 0x7f1e24099a20 , cancel= false ) at /qserv/stack/stack/miniconda3- 4.3 . 21 -10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiRequest.cc: 99 # 3 0x00007f1e6e918746 in XrdSsiFileReq::Finalize ( this = 0x7f1e24099a20 ) at /qserv/stack/stack/miniconda3- 4.3 . 21 -10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFileReq.cc: 444 # 4 0x00007f1e6e91caac in XrdSsiFileSess::read ( this = 0x7f1e240993a0 , offset= 0 , buff= 0x7f1e3c002400 "\b" , blen= 67 ) at /qserv/stack/stack/miniconda3- 4.3 . 21 -10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFileSess.cc: 505 # 5 0x00007f1e6e91631c in XrdSsiFile::read ( this = 0x7f1e24098ad0 , offset= 0 , buff= 0x7f1e3c002400 "\b" , blen= 67 ) at /qserv/stack/stack/miniconda3- 4.3 . 21 -10a4fa6/EupsBuildDir/Linux64/xrootd-master-g32dc508f35/xrootd-master-g32dc508f35/src/XrdSsi/XrdSsiFile.cc: 406 # 6 0x00007f1e74c64c19 in XrdXrootdProtocol::do_ReadAll ( this = 0x7f1e3c001fe0 , asyncOK= 1 ) I've moved the core file to ccqserv122:/qserv/log/.
            Hide
            jgates John Gates added a comment -

            Analyze table needs to be run before this can be closed. Notes for doing this are in DM-10360 and it can be done once the cluster is back up.

            Show
            jgates John Gates added a comment - Analyze table needs to be run before this can be closed. Notes for doing this are in DM-10360  and it can be done once the cluster is back up.
            Hide
            vaikunth Vaikunth Thukral added a comment -

            Data loading is done, next step is to generate statistics for which a ticket has been created (DM-12566)

            Show
            vaikunth Vaikunth Thukral added a comment - Data loading is done, next step is to generate statistics for which a ticket has been created ( DM-12566 )

              People

              Assignee:
              jgates John Gates
              Reporter:
              jgates John Gates
              Watchers:
              Andy Salnikov, John Gates, Vaikunth Thukral
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: