Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-5307

Get high volume test script working again at IN2P3 cluster

    Details

    • Type: Story
    • Status: Done
    • Resolution: Done
    • Fix Version/s: None
    • Component/s: Database
    • Labels:
      None

      Description

      Currently runquerys.py script falls over when running high-volume tests:

      • "LV": 75,
      • "FTSObj": 3,
      • "FTSSrc": 1,
      • "FTSFSrc": 1,
      • "joinObjSrc": 1,
      • "joinObjFSrc": 1,
      • "nearN": 1

      We need this to be working again to validate recent work on schedulers and to support upcoming work on large results, etc.

        Attachments

          Issue Links

            Activity

            Hide
            jgates John Gates added a comment - - edited

            I ran the following to put shared scan information in the cluster

                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ("/DBS/LSST/TABLES/Source/sharedScan", "", "29");
                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ('/DBS/LSST/TABLES/Source/sharedScan/.packed.json', '{"lockInMem":"1","scanRating":"15"}',46);
             
                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ("/DBS/LSST/TABLES/Object/sharedScan", "", "33");
                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ('/DBS/LSST/TABLES/Object/sharedScan/.packed.json', '{"lockInMem":"1","scanRating":"5"}',48);
             
                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ("/DBS/LSST/TABLES/ForcedSource/sharedScan", "", "37");
                INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ('/DBS/LSST/TABLES/ForcedSource/sharedScan/.packed.json', '{"lockInMem":"1","scanRating":"16"}', 50);
            
            

            Show
            jgates John Gates added a comment - - edited I ran the following to put shared scan information in the cluster INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( "/DBS/LSST/TABLES/Source/sharedScan" , "" , "29" ); INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( '/DBS/LSST/TABLES/Source/sharedScan/.packed.json' , '{"lockInMem":"1","scanRating":"15"}' ,46);   INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( "/DBS/LSST/TABLES/Object/sharedScan" , "" , "33" ); INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( '/DBS/LSST/TABLES/Object/sharedScan/.packed.json' , '{"lockInMem":"1","scanRating":"5"}' ,48);   INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( "/DBS/LSST/TABLES/ForcedSource/sharedScan" , "" , "37" ); INSERT INTO qservCssData.kvData (kvKey, kvVal, parentKvId) VALUES ( '/DBS/LSST/TABLES/ForcedSource/sharedScan/.packed.json' , '{"lockInMem":"1","scanRating":"16"}' , 50);
            Hide
            jgates John Gates added a comment - - edited

            The mysql proxy becomes unresponsive while running runQueries.py and I believe this is causing connection issues. Note a variable value request takes 46 seconds shortly before runQueries fails. Increasing net_read_timeout from 30 to 60 and connect_timeout from 10 to 20 had no noticable affect.

             
            mysql> SHOW VARIABLES LIKE 'net_read_timeout';
            +------------------+-------+
            | Variable_name    | Value |
            +------------------+-------+
            | net_read_timeout | 60    |
            +------------------+-------+
            1 row in set (46.42 sec)
            

            errors (sample of errors encountered during a single runQueries.py execution, 'all backend are down' is far more frequent)

              File "runQueries.py", line 182, in runQueries
                db='LSST')
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/__init__.py", line 81, in Connect
                return Connection(*args, **kwargs)
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 187, in __init__
                super(Connection, self).__init__(*args, **kwargs2)
            OperationalError: (1105, '(proxy) all backends are down')
             
             
              File "runQueries.py", line 182, in runQueries
                db='LSST')
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/__init__.py", line 81, in Connect
                return Connection(*args, **kwargs)
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 187, in __init__
                super(Connection, self).__init__(*args, **kwargs2)
            OperationalError: (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 0")
             
              File "runQueries.py", line 189, in runQueries
                cursor.execute(q)
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/cursors.py", line 174, in execute
                self.errorhandler(self, exc, value)
              File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 36, in defaulterrorhandler
                raise errorclass, errorvalue
            InterfaceError: (-1, 'error totally whack')
            

            Show
            jgates John Gates added a comment - - edited The mysql proxy becomes unresponsive while running runQueries.py and I believe this is causing connection issues. Note a variable value request takes 46 seconds shortly before runQueries fails. Increasing net_read_timeout from 30 to 60 and connect_timeout from 10 to 20 had no noticable affect. mysql> SHOW VARIABLES LIKE 'net_read_timeout'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | net_read_timeout | 60 | +------------------+-------+ 1 row in set (46.42 sec) errors (sample of errors encountered during a single runQueries.py execution, 'all backend are down' is far more frequent) File "runQueries.py", line 182, in runQueries db='LSST') File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 187, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (1105, '(proxy) all backends are down')     File "runQueries.py", line 182, in runQueries db='LSST') File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 187, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 0")   File "runQueries.py", line 189, in runQueries cursor.execute(q) File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/afs/in2p3.fr/home/j/jgates/stack/Linux64/mysqlpython/1.2.3.lsst1/lib/python/MySQL_python-1.2.3-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue InterfaceError: (-1, 'error totally whack')
            Hide
            jgates John Gates added a comment - - edited

            Running queries seems to cause mysql-proxy to leak memory when queries are run, according to top output. The amount of memory leaked appears to related to the size of the results.

            Show
            jgates John Gates added a comment - - edited Running queries seems to cause mysql-proxy to leak memory when queries are run, according to top output. The amount of memory leaked appears to related to the size of the results.
            Hide
            jgates John Gates added a comment -

            runQueries.py, at least the old version works again. The problem was that mysql-proxy was having to wait while the czar distributing the jobs for the user query to the worker. Putting the job distribution in a separate thread appears to have fixed the problem.

            Show
            jgates John Gates added a comment - runQueries.py, at least the old version works again. The problem was that mysql-proxy was having to wait while the czar distributing the jobs for the user query to the worker. Putting the job distribution in a separate thread appears to have fixed the problem.
            Hide
            salnikov Andy Salnikov added a comment -

            John, I have lied to you when I said yesterday that new in-process czar should behave exactly as old XML-RPC stuff. I have re-checked the old implementation and I see that it's not true, in the old code UserQuery::submit() was indeed executed in a separate thread, so czar returned earlier. In the new code we can do the same by moving call to uq->submit(); inside finalizer lambda in Czar::submitQuery() except that we cannot do it because submit() calculates queryId and result table name which are needed by czar when it returns from Czar::submitQuery(). I would suggest to move the code that registers query in QMeta from submit() into constructor, then we can execute the whole submit() method in a finalizer thread, I think that should make the code abit more cleaner than your current solution.

            Sorry for this mess, I should have checked it sooner

            Show
            salnikov Andy Salnikov added a comment - John, I have lied to you when I said yesterday that new in-process czar should behave exactly as old XML-RPC stuff. I have re-checked the old implementation and I see that it's not true, in the old code UserQuery::submit() was indeed executed in a separate thread, so czar returned earlier. In the new code we can do the same by moving call to uq->submit(); inside finalizer lambda in Czar::submitQuery() except that we cannot do it because submit() calculates queryId and result table name which are needed by czar when it returns from Czar::submitQuery() . I would suggest to move the code that registers query in QMeta from submit() into constructor, then we can execute the whole submit() method in a finalizer thread, I think that should make the code abit more cleaner than your current solution. Sorry for this mess, I should have checked it sooner
            Hide
            jgates John Gates added a comment -

            I'll need to look closer. It looked like there was a minor race condition in Executive::add(), which is fixed.

            Show
            jgates John Gates added a comment - I'll need to look closer. It looked like there was a minor race condition in Executive::add(), which is fixed.
            Hide
            salnikov Andy Salnikov added a comment -

            John, looks OK to me, couple of minor comments in PR.

            Show
            salnikov Andy Salnikov added a comment - John, looks OK to me, couple of minor comments in PR.

              People

              • Assignee:
                jgates John Gates
                Reporter:
                fritzm Fritz Mueller
                Reviewers:
                Andy Salnikov
                Watchers:
                Andy Salnikov, Fritz Mueller, John Gates
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: