# Assist with KPM30 tests.

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
None
• Story Points:
13
• Sprint:
DB_F18_06
• Team:
Data Access and Database

#### Description

Debug Qserv shared scan scheduling and related issues for KPM30.

#### Activity

Hide
John Gates added a comment - - edited

Between KPM20 and now something has crept into the system that is causing queries to run far more slowly than they did before.

The vast majority of waits are quite reasonable, but the number of header transmits taking more than 40 seconds seems high.

 memWait=66.9234 avg=2.69793 <1=38446 <5=7487 <10=5862 <20=1426 <40=986 >40=583 header transmit waited for 0.000175 <0.1=57043 <1=915 <5=1229 <10=710 <20=313 <40=381 >40=2214 transmit waited for 0.004942 <0.1=61500 <1=947 <5=124 <10=39 <20=26 <40=40 >40=125

Looking at one case in particular where the worker waited 46 seconds before a header was transferred

 [2018-06-10T01:29:21.443+0200] [LWP:16722] DEBUG wdb.QueryRunner (core/modules/wdb/QueryRunner.cc:449) - QI=3696693:7472; h&&& transmit waited for 46.061 <0.1=57006 <1=915 <5=1229 <10=710 <20=311 <40=381 >40=2213

On the czar, the czar doesn't appear to be particularly busy. The priority queue has 7 threads running, and the queues are essentially empty. Things should be answered almost immediately, as there should be CPUs and threads available.  Then there's this strange gap between T01:28:35.384 and T01:28:48 where there's nothing added to the log, and this does not appear to be an isolated incident. It seems that nothing being added to the log for 10 seconds or more is a frequent occurrence on both the workers and the czar.

 QI=3696693: assigned to UserQuery:SELECT o.deepSourceId,f.psfFlux FROM Object o,ForcedSource f WHERE o.deepSourceId=f.deepSourceId AND f.psfFlux BETWEEN 0.13 AND 0.14   [2018-06-10T01:28:35.383+0200] [LWP:1556] DEBUG qdisp.QdispPool (core/modules/qdisp/QdispPool.cc:109) - priQueGet (pr=0:sz=0:r=0)(pr=1:sz=0:r=0)(pr=2:sz=0:r=0)(pr=3:sz=1:r=5)(pr=4:sz=0:r=0)(pr=5:sz=0:r=0)(pr=6:sz=0:r=2)(pr=7:sz=0:r=0)(pr=100:sz=0:r=0)   [2018-06-10T01:28:35.383+0200] [LWP:33878] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:224) - QI=3696693:7472;workerName=257@10.158.37.109:1094 ProcessResponse [2018-06-10T01:28:35.383+0200] [LWP:33878] INFO qdisp.JobStatus (core/modules/qdisp/JobStatus.cc:61) - QI=3696693:7472; Updating state to: RESPONSE_READY code=0 [2018-06-10T01:28:35.384+0200] [LWP:1556] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:91) - QI=3696693:7472; AskForResp GetResponseData size=256 [2018-06-10T01:28:35.384+0200] [LWP:1556] INFO xrdssi.msgs (client:0) - Ssi_TaskSetBuff: Async Status=isReady id=0 [2018-06-10T01:28:35.384+0200] [LWP:1556] INFO xrdssi.msgs (client:0) - [0xe8e54a70@xroot://257@ccqserv100.in2p3.fr:1094//chk/LSST/10474] Sending a read command for handle 0x152 to 257@10.158.37.109:1094 [2018-06-10T01:28:48.128+0200] [LWP:33816] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc128eb700 xeqT=7fc873b1f700     [2018-06-10T01:29:21.197+0200] [LWP:438] DEBUG qdisp.QdispPool (core/modules/qdisp/QdispPool.cc:109) -  priQueGet (pr=0:sz=0:r=0)(pr=1:sz=0:r=0)(pr=2:sz=0:r=0)(pr=3:sz=0:r=8)(pr=4:sz=0:r=0)(pr=5:sz=0:r=0)(pr=6:sz=0:r=2)(pr=7:sz=0:r=0)(pr=100:sz=0:r=0)   [2018-06-10T01:29:21.443+0200] [LWP:33785] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:417) - QI=3696693:7472; ProcessResponseData with buflen=256 (more) [2018-06-10T01:29:21.443+0200] [LWP:33785] INFO qdisp.JobStatus (core/modules/qdisp/JobStatus.cc:61) - QI=3696693:7472; Updating state to: RESPONSE_DATA code=0

worker gaps (six 10

second gaps in 2 minutes)

 [2018-06-10T01:27:00.196+0200] [LWP:16127] DEBUG xrdsvc.ChannelStream (core/modules/xrdsvc/ChannelStream.cc:86) - Waiting, no data ready [2018-06-10T01:27:12.226+0200] [LWP:16712] DEBUG wbase.Task (core/modules/wbase/Task.cc:277) - QI=3696701:7496; waitForMemMan end   [2018-06-10T01:27:15.390+0200] [LWP:16712] INFO xrdssi.msgs (xrootd:0) - 259.314:27@10.158.37.100 Ssi_WakeUp: 0:/chk/LSST/10541 [bound doRsp] respCBarg=1a3001b00000005 [2018-06-10T01:27:24.644+0200] [LWP:16726] DEBUG wdb.ChunkResource (core/modules/wdb/ChunkResource.cc:244) - SubChunk release refC=1 db=LSST dbTableSet[[]] sc[[]]     [2018-06-10T01:27:35.265+0200] [LWP:361] DEBUG wpublish.QueriesAndChunks (core/modules/wpublish/QueriesAndChunks.cc:199) - QueriesAndChunks::removeDead deadQueries size=0 [2018-06-10T01:27:40.561+0200] [LWP:16709] DEBUG wdb.QueryRunner (core/modules/wdb/QueryRunner.cc:528) - QI=3696693:7448; &&& running fragment timer=113.612   [2018-06-10T01:27:46.473+0200] [LWP:16712] DEBUG wsched.BlendScheduler (core/modules/wsched/BlendScheduler.cc:268) - BlendSched_ready() SchedGroup(r=0 sz=0 fl=0 avail=14) SchedSlow(r=0 sz=0 fl=0 avail=14) SchedMed(r=0 sz=29 fl=3 avail=13) SchedFast(r=0 sz=0 fl=0 avail=13) SchedSnail(r=0 sz=0 fl=0 avail=13) [2018-06-10T01:28:35.379+0200] [LWP:16722] DEBUG wdb.QueryRunner (core/modules/wdb/QueryRunner.cc:528) - QI=3696693:7472; &&& running fragment timer=105.921     [2018-06-10T01:28:35.382+0200] [LWP:16722] DEBUG xrdsvc.ChannelStream (core/modules/xrdsvc/ChannelStream.cc:72) - Trying to append message (flowing) [2018-06-10T01:28:48.125+0200] [LWP:16719] DEBUG wdb.QueryRunner (core/modules/wdb/QueryRunner.cc:528) - QI=3696693:7496; &&& running fragment timer=82.371     [2018-06-10T01:28:51.100+0200] [LWP:16155] DEBUG xrdsvc.ChannelStream (core/modules/xrdsvc/ChannelStream.cc:86) - Waiting, no data ready [2018-06-10T01:29:03.321+0200] [LWP:16712] DEBUG wbase.Task (core/modules/wbase/Task.cc:277) - QI=3696701:7544; waitForMemMan end

czar - gaps tend to be shorter

 [2018-06-10T01:27:50.291+0200] [LWP:20595] DEBUG qdisp.Executive (core/modules/qdisp/Executive.cc:514) - QI=3696692: Still 9 in flight. [2018-06-10T01:28:35.382+0200] [LWP:33878] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc118e9700 xeqT=7fc858ff1700   [2018-06-10T01:28:35.384+0200] [LWP:1556] INFO xrdssi.msgs (client:0) - [0xe8e54a70@xroot://257@ccqserv100.in2p3.fr:1094//chk/LSST/10474] Sending a read command for handle 0x152 to 257@10.158.37.109:1094 [2018-06-10T01:28:48.128+0200] [LWP:33816] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc128eb700 xeqT=7fc873b1f700     [2018-06-10T01:28:48.139+0200] [LWP:530] INFO xrdssi.msgs (client:0) - [0x18cbd230@xroot://256@ccqserv100.in2p3.fr:1094//chk/LSST/10541] Sending a read command for handle 0x7c to 256@10.158.37.109:1094 [2018-06-10T01:28:51.088+0200] [LWP:34052] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc128eb700 xeqT=7fc86fb17700   [2018-06-10T01:28:51.099+0200] [LWP:1372] INFO xrdssi.msgs (client:0) - [0x7cf95380@xroot://258@ccqserv100.in2p3.fr:1094//chk/LSST/10565] Sending a read command for handle 0x13e to 258@10.158.37.109:1094 [2018-06-10T01:29:04.981+0200] [LWP:33785] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc128eb700 xeqT=7fc8837fe700   [2018-06-10T01:29:05.546+0200] [LWP:740] INFO xrdssi.msgs (client:0) - [0x74e73db0@xroot://256@ccqserv100.in2p3.fr:1094//chk/LSST/10589] Sending a read command for handle 0x177 to 256@10.158.37.109:1094 [2018-06-10T01:29:21.178+0200] [LWP:34077] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isReady clrT=7fcc128eb700 xeqT=7fc84dfdb700   [2018-06-10T01:29:21.467+0200] [LWP:34052] INFO xrdssi.msgs (client:0) - ServReal Ssi_Recycle: reuse=True; sessions: free=256 active=11 [2018-06-10T01:29:25.670+0200] [LWP:34124] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc118e9700 xeqT=7fc83e2b4700   [2018-06-10T01:29:30.293+0200] [LWP:20595] DEBUG qdisp.Executive (core/modules/qdisp/Executive.cc:514) - QI=3696692: Still 7 in flight. [2018-06-10T01:29:50.396+0200] [LWP:34052] INFO xrdssi.msgs (client:0) - Ssi_TaskXeqEvent: sess=ok id=0 aOK=True isSync clrT=7fcc128eb700 xeqT=7fc86fb17700

It could be that I haven't noticed gaps like this in the past, but normally I expect the logs to not have gaps more than a tenth of a second while queries are running.

Show
Hide
John Gates added a comment -

It is likely the gaps on the worker were caused by queries running on the worker taking a very long period of time.  SELECT o.deepSourceId,f.psfFlux FROM LSST30.Object AS o,LSST30.ForcedSource AS f WHERE o.deepSourceId=f.deepSourceId AND f.psfFlux BETWEEN 0.13 AND 0.14;  Appears to be running with incorrect optimization on many chunks, which makes it take much longer for mariadb to run  (14 seconds vs. 7 minutes). Tests are being run without the Object and ForcedSource joins.

Show
John Gates added a comment - It is likely the gaps on the worker were caused by queries running on the worker taking a very long period of time.  SELECT o.deepSourceId,f.psfFlux FROM LSST30.Object AS o,LSST30.ForcedSource AS f WHERE o.deepSourceId=f.deepSourceId AND f.psfFlux BETWEEN 0.13 AND 0.14;  Appears to be running with incorrect optimization on many chunks, which makes it take much longer for mariadb to run  (14 seconds vs. 7 minutes). Tests are being run without the Object and ForcedSource joins.
Hide
John Gates added a comment -

After running several tests with Vaikunth. Groups of full table scans run fine if they are all on the same schedule (several queries on the Object table, several queries on the Source table, some join queries on Source and Object). However, when some queries are on different worker schedulers, things can slow down dramatically. The slowdown appears to be characterized by long waits for memman to finish locking tables in memory.

Show
John Gates added a comment - After running several tests with Vaikunth. Groups of full table scans run fine if they are all on the same schedule (several queries on the Object table, several queries on the Source table, some join queries on Source and Object). However, when some queries are on different worker schedulers, things can slow down dramatically. The slowdown appears to be characterized by long waits for memman to finish locking tables in memory.
Hide
John Gates added a comment -

We made some progress. It appears part of the problem has to do with calls to mlock. If there is more than one call to mlock at any given time, they tend to clash and take significantly longer. Calls to MemMan lock were placed in a queue, but this resulted in calls to lock files that were already locked, and expected to happen nearly instantly, getting stuck behind calls to lock large files. The queue was disabled, but there was no lower level mutex to prevent more than one mlock call at a time. Adding a mutex at the mlock call significantly improved performance.

Show
John Gates added a comment - We made some progress. It appears part of the problem has to do with calls to mlock. If there is more than one call to mlock at any given time, they tend to clash and take significantly longer. Calls to MemMan lock were placed in a queue, but this resulted in calls to lock files that were already locked, and expected to happen nearly instantly, getting stuck behind calls to lock large files. The queue was disabled, but there was no lower level mutex to prevent more than one mlock call at a time. Adding a mutex at the mlock call significantly improved performance.
Hide
John Gates added a comment -

The main problem appears to be that not enough scheduler threads are being run by the OS.  The scheduler is expecting 12 threads to get at least enough time to get jobs from the queue. Only 4 or so threads are actually getting enough time to pull jobs off the queue, which is leaving the other scheduler starved for threads. Calls to notify appear to work correctly on other systems. Tuning the process scheduler may help. Adjusting the worker scheduler may help as well.

Show
John Gates added a comment -  The main problem appears to be that not enough scheduler threads are being run by the OS.  The scheduler is expecting 12 threads to get at least enough time to get jobs from the queue. Only 4 or so threads are actually getting enough time to pull jobs off the queue, which is leaving the other scheduler starved for threads. Calls to notify appear to work correctly on other systems. Tuning the process scheduler may help. Adjusting the worker scheduler may help as well.
Hide
Andy Hanushevsky added a comment -

Hmmm, the os does not pre-emtively limit the number of threads that can
multiplexing the time that exists (not very efficient but that is what it
does). So, I suspect that something blocked the thread and while they
should have been running they were not able to run.

Andy

Show
Andy Hanushevsky added a comment - Hmmm, the os does not pre-emtively limit the number of threads that can run. If you have x threads ready then the kernel will run x threads multiplexing the time that exists (not very efficient but that is what it does). So, I suspect that something blocked the thread and while they should have been running they were not able to run. Andy
Hide
John Gates added a comment - - edited

At least on some occasions, there is significant blocking on the MemManReal::prepare call while files are being locked. On at least some occasions, the file locking performance is awful (most likely due to poor I/O performance as locking a 200MB file shouldn't take 90 seconds).

There could well be system scheduler issues, but it is difficult to tell. The mutex for the scheduling threads is taking a median of about 10 seconds to lock (which is bad) but is only being held for a fraction of a second by most threads. The 1% of the time a thread holds the mutex for more than 10 seconds can have very serious effects.

I believe we should be able isolate the prepare call and the lock call so that they do not interfere with each other.

Show
John Gates added a comment - - edited At least on some occasions, there is significant blocking on the MemManReal::prepare call while files are being locked. On at least some occasions, the file locking performance is awful (most likely due to poor I/O performance as locking a 200MB file shouldn't take 90 seconds). There could well be system scheduler issues, but it is difficult to tell. The mutex for the scheduling threads is taking a median of about 10 seconds to lock (which is bad) but is only being held for a fraction of a second by most threads. The 1% of the time a thread holds the mutex for more than 10 seconds can have very serious effects. I believe we should be able isolate the prepare call and the lock call so that they do not interfere with each other.
Hide
John Gates added a comment -

Vaikunth Thukral noticed that swap space on the cluster had been set to zero, which is likely having a detrimental effect on performance as there is not much memory on the computers.

Shifting mutexes so that there is far less overlap improved scheduling and times for LSST.  This did not translate to improved time LSST30 data. Part of this could well be due to the lack of swap space. Another possibility is that mmap calls occurring in the middle of mlock calls could have the same sort of detrimental effect on performance as mlock calls overlapping with other mlock calls. Extending the mutex protecting mlock calls to cover mmap should prevent this. Note that mmap very rarely overlapped with mlock on LSST but did so frequently on LSST30. The mlock calls on LSST30 where usually less than 5 seconds long, very rarely going over 10 seconds. So, extending the mutex to cover mmap should have a limited performance hit. The long delays (more than 40 seconds) seem to have been caused by mutex contention over checking the value of memman::MemFile::_isLocked, but I'm still looking into it.

The log file contained numerous lines like " MemManStats LockMax=7900000000 Locked=0 Reserved=7459586562 " where the amount of memory reserved by mmap calls was significant and covered at least 3 or 4 different files, but nothing was locked. The amount of memory locked was usually only a small fraction of the memory reserved. I am hoping that restoring swap space improves this situation.

Show
John Gates added a comment - Vaikunth Thukral noticed that swap space on the cluster had been set to zero, which is likely having a detrimental effect on performance as there is not much memory on the computers. Shifting mutexes so that there is far less overlap improved scheduling and times for LSST.  This did not translate to improved time LSST30 data. Part of this could well be due to the lack of swap space. Another possibility is that mmap calls occurring in the middle of mlock calls could have the same sort of detrimental effect on performance as mlock calls overlapping with other mlock calls. Extending the mutex protecting mlock calls to cover mmap should prevent this. Note that mmap very rarely overlapped with mlock on LSST but did so frequently on LSST30. The mlock calls on LSST30 where usually less than 5 seconds long, very rarely going over 10 seconds. So, extending the mutex to cover mmap should have a limited performance hit. The long delays (more than 40 seconds) seem to have been caused by mutex contention over checking the value of memman::MemFile::_isLocked, but I'm still looking into it. The log file contained numerous lines like " MemManStats LockMax=7900000000 Locked=0 Reserved=7459586562 " where the amount of memory reserved by mmap calls was significant and covered at least 3 or 4 different files, but nothing was locked. The amount of memory locked was usually only a small fraction of the memory reserved. I am hoping that restoring swap space improves this situation.

#### People

Assignee:
John Gates
Reporter:
John Gates
Watchers:
Andy Hanushevsky, Fritz Mueller, John Gates, Vaikunth Thukral