Analyzing the query processing by Czar (log files)
In a specific deployment of Qserv in PDAC the master log files are located on node lsst-qserv-master01 at:
% sudo ls -al /qserv/log
|
total 4020164
|
drwxr-xr-x 3 qserv qserv 154 Nov 7 12:13 .
|
drwxrwx--- 4 qserv qserv 27 Nov 3 15:28 ..
|
-rw-r----- 1 qserv qserv 76187 Nov 7 12:17 mysqld.log
|
-rw-r----- 1 qserv qserv 17627 Nov 8 19:53 mysql-proxy.log
|
-rw-r--r-- 1 qserv qserv 4109190215 Nov 8 20:17 mysql-proxy-lua.log
|
-rw-r--r-- 1 qserv qserv 4092 Nov 4 15:49 qserv-watcher.log
|
-rw-r--r-- 1 qserv qserv 7331575 Nov 7 12:17 qserv-wmgr.log
|
drwxr-xr-x 2 qserv qserv 8192 Nov 8 00:00 worker
|
-rw-r--r-- 1 qserv qserv 0 Oct 11 02:37 xrootd-console.log
|
The query in question and its timing was found by parsing the log file of the MySQL proxy service:
sudo cat /qserv/log/mysql-proxy-lua.log | \
|
grep 'UserQuery registered' | \
|
awk -F\[ '{print $2" "$3}' | \
|
awk -F\] '{print $1": "$3}' | \
|
awk -F\: '{print $1":"$2":"$3" "$6}' | \
|
sed -e 's/ UserQuery registered //'
|
..
|
2016-11-08T18:14:59.902-0600 select count(*) from Science_Ccd_Exposure,RunDeepForcedSource where scienceCcdExposureId=exposure_id and objectId=3219370448785419 and filterId=1;
|
..
|
The next step was to find the Query ID corresponding to that query by parsing the log file in the vicinity of the query. The identifier was
Then all records containing this identifier were pulled from the log file:
sudo fgrep 'QI=493' /qserv/log/mysql-proxy-lua.log > temp/QI_493.log
|
Further analysis of that file allowed to get the boundaries of the query:
head -1 temp/QI_493.log
|
[2016-11-08T19:18:50.724-0600] [LWP:1702] DEBUG qmeta.QMetaMysql (core/modules/qmeta/QMetaMysql.cc:300) - QI=493: Executing query: INSERT INTO QTable (queryId, dbName, tblName) VALUES (493, 'sdss_stripe82_00', 'Science_Ccd_Exposure')
|
|
tail -1 temp/QI_493.log
|
[2016-11-08T19:18:55.070-0600] [LWP:4195] DEBUG ccontrol.UserQuerySelect (core/modules/ccontrol/UserQuerySelect.cc:293) - QI=493: Discarded UserQuerySelect
|
And am ultimate reason for this particular bug report was found in the same log file. According to the log file Master broadcast this query to all 1080 chunks. There are 1080 query sub-ID*s reported in the log, and for each such *sub-ID there is a dedicated instance of the QueryRequest object instantiated at the begining and discarded in the end. For example:
...
|
[2016-11-08T19:18:54.455-0600] [LWP:4080] DEBUG root (core/modules/qdisp/QueryRequest.cc:287) - QI=493:1079; QueryRequest::cleanup()
|
[2016-11-08T19:18:54.577-0600] [LWP:4195] DEBUG ccontrol.UserQuerySelect (core/modules/ccontrol/UserQuerySelect.cc:254) - QI=493: Joined everything (success)
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1079;
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1079; ~QueryRequest
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1078;
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1078; ~QueryRequest
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1077;
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1077; ~QueryRequest
|
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1076;
|
...
|
More over, there is a direct evidence that this query is supposed to be "chunked*. This is seen in the log file analysis:
% sudo fgrep 'QI=493' -B 100 /qserv/log/mysql-proxy-lua.log
|
...
|
[2016-11-08T19:18:50.720-0600] [LWP:1702] DEBUG qproc.QuerySession (core/modules/qproc/QuerySession.cc:108) - Query Plugins applied:
|
QuerySession description:
|
original: select count(*) from Science_Ccd_Exposure,RunDeepForcedSource where scienceCcdExposureId=exposure_id and objectId=3219370448785419 and filterId=1;
|
has chunks: 1
|
chunks: []
|
needs merge: 1
|
1st parallel statement: SELECT count(*) AS QS1_COUNT FROM sdss_stripe82_00.Science_Ccd_Exposure AS QST_1_,sdss_stripe82_00.RunDeepForcedSource_%CC% AS QST_2_ WHERE scienceCcdExposureId=exposure_id AND objectId=3219370448785419 AND fi
|
lterId=1
|
merge statement: SELECT SUM(QS1_COUNT)
|
scanRating:15 ScanTable: sdss_stripe82_00.Science_Ccd_Exposure lock=0 rating=0
|
ScanTable: sdss_stripe82_00.RunDeepForcedSource lock=1 rating=15
|
...
|
Analyzing the query processing by Czar (log files)
In a specific deployment of Qserv in PDAC the master log files are located on node lsst-qserv-master01 at:
total 4020164
drwxr-xr-x 3 qserv qserv 154 Nov 7 12:13 .
drwxrwx--- 4 qserv qserv 27 Nov 3 15:28 ..
-rw-r----- 1 qserv qserv 76187 Nov 7 12:17 mysqld.log
-rw-r----- 1 qserv qserv 17627 Nov 8 19:53 mysql-proxy.log
-rw-r--r-- 1 qserv qserv 4109190215 Nov 8 20:17 mysql-proxy-lua.log
-rw-r--r-- 1 qserv qserv 4092 Nov 4 15:49 qserv-watcher.log
-rw-r--r-- 1 qserv qserv 7331575 Nov 7 12:17 qserv-wmgr.log
drwxr-xr-x 2 qserv qserv 8192 Nov 8 00:00 worker
-rw-r--r-- 1 qserv qserv 0 Oct 11 02:37 xrootd-console.log
The query in question and its timing was found by parsing the log file of the MySQL proxy service:
..
..
The next step was to find the Query ID corresponding to that query by parsing the log file in the vicinity of the query. The identifier was
493
Then all records containing this identifier were pulled from the log file:
Further analysis of that file allowed to get the boundaries of the query:
And am ultimate reason for this particular bug report was found in the same log file. According to the log file Master broadcast this query to all 1080 chunks. There are 1080 query sub-ID*s reported in the log, and for each such *sub-ID there is a dedicated instance of the QueryRequest object instantiated at the begining and discarded in the end. For example:
...
[2016-11-08T19:18:54.455-0600] [LWP:4080] DEBUG root (core/modules/qdisp/QueryRequest.cc:287) - QI=493:1079; QueryRequest::cleanup()
[2016-11-08T19:18:54.577-0600] [LWP:4195] DEBUG ccontrol.UserQuerySelect (core/modules/ccontrol/UserQuerySelect.cc:254) - QI=493: Joined everything (success)
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1079;
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1079; ~QueryRequest
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1078;
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1078; ~QueryRequest
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1077;
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.QueryRequest (core/modules/qdisp/QueryRequest.cc:66) - QI=493:1077; ~QueryRequest
[2016-11-08T19:18:54.828-0600] [LWP:4195] DEBUG qdisp.JobQuery (core/modules/qdisp/JobQuery.cc:60) - ~JobQuery QI=493:1076;
...
More over, there is a direct evidence that this query is supposed to be "chunked*. This is seen in the log file analysis:
...
QuerySession description:
has chunks: 1
chunks: []
needs merge: 1
lterId=1
merge statement: SELECT SUM(QS1_COUNT)
scanRating:15 ScanTable: sdss_stripe82_00.Science_Ccd_Exposure lock=0 rating=0
ScanTable: sdss_stripe82_00.RunDeepForcedSource lock=1 rating=15
...