For the query SELECT COUNT( * ) FROM ForcedSource, the first time qserv tries it, it slowly works it way through each chunk. On the workers, the first time through the query takes 5-10 minutes. After this first time, the count( * ) queries happen immediately. Mariadb is writing something to the ForcedSource_*.MYI files and once that is done the queries are fast.
More concerning, the czar works its way through these queries for a few hours and then dies. The workers see this and stop functioning with this message
[2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/37807 [bound odRsp] Calling Finished(False)
|
[2018-01-12T21:31:18.614Z] [LWP:42] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession()
|
[2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/37807 [bound odRsp] Recycling request...
|
[2018-01-12T21:31:39.231Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - Login: Primary server 30 logged out
|
[2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - 1 message lost!
|
[2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - State: Status changed to suspended
|
[2018-01-12T21:43:48.301Z] [LWP:97] INFO xrdssi.msgs (cmsd:0) - Manager: manager.0:26@10.158.37.100 removed; request read failed
|
The czar's log has heaps upon heaps of messages like "[2018-01-12T21:43:47.179Z] [LWP:327] INFO xrdssi.msgs (client:0) - Registering task: "WaitTask for: 0x0x7f49e06c95a0" to be run at: [2018-01-12 21:43:57 +0000]". More than 2000 "WaitTask" related message in a second. After 2018-01-12T21:39:00.142Z, there appears to be nothing in the log file except for "WaitTask" messages. There are no errors, warnings, or indications that I have found that anything has changed near this point, but the czar is no longer communicating with workers, doing nothing but WaitTask related work.
Both the czar and worker behavior are worrisome. The czar should not be going down, while the workers should keep going, killing tasks as appropriate.
For the query SELECT COUNT( * ) FROM ForcedSource, the first time qserv tries it, it slowly works it way through each chunk. On the workers, the first time through the query takes 5-10 minutes. After this first time, the count( * ) queries happen immediately. Mariadb is writing something to the ForcedSource_*.MYI files and once that is done the queries are fast.
More concerning, the czar works its way through these queries for a few hours and then dies. The workers see this and stop functioning with this message
[2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Finalize: 0:/chk/LSST30/37807 [bound odRsp] Calling Finished(False)
[2018-01-12T21:31:18.614Z] [LWP:42] DEBUG xrdsvc.SsiSession (core/modules/xrdsvc/SsiSession.cc:55) - ~SsiSession()
[2018-01-12T21:31:18.614Z] [LWP:42] INFO xrdssi.msgs (xrootd:0) - qserv.30:24@10.158.37.100 Ssi_Unbind: 0:/chk/LSST30/37807 [bound odRsp] Recycling request...
[2018-01-12T21:31:39.231Z] [LWP:116] INFO xrdssi.msgs (cmsd:0) - Login: Primary server 30 logged out
[2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - 1 message lost!
[2018-01-12T21:31:39.425Z] [LWP:117] INFO xrdssi.msgs (cmsd:0) - State: Status changed to suspended
[2018-01-12T21:43:48.301Z] [LWP:97] INFO xrdssi.msgs (cmsd:0) - Manager: manager.0:26@10.158.37.100 removed; request read failed
The czar's log has heaps upon heaps of messages like "[2018-01-12T21:43:47.179Z] [LWP:327] INFO xrdssi.msgs (client:0) - Registering task: "WaitTask for: 0x0x7f49e06c95a0" to be run at: [2018-01-12 21:43:57 +0000]". More than 2000 "WaitTask" related message in a second. After 2018-01-12T21:39:00.142Z, there appears to be nothing in the log file except for "WaitTask" messages. There are no errors, warnings, or indications that I have found that anything has changed near this point, but the czar is no longer communicating with workers, doing nothing but WaitTask related work.
Both the czar and worker behavior are worrisome. The czar should not be going down, while the workers should keep going, killing tasks as appropriate.