Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-21126

Investigate SQuaSH API 500 Server Error

    XMLWordPrintable

    Details

    • Type: Story
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      These errors are becoming more frequent now and require better monitoring:

      https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/323/
      https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/336/ https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/337/

      It is not always reproducible and probably triggered by a new condition. API logs are not useful to track this down:

      [pid: 22|app: 0|req: 185377/314425] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:31:52 2019] POST /job => generated 151 bytes in 1861 msecs (HTTP/1.1 202) 2 headers in 78 bytes (1 switches on core 1)
      [pid: 21|app: 0|req: 94227/314427] 10.20.0.1 () {42 vars in 552 bytes} [Thu Aug 29 12:31:57 2019] POST /auth => generated 188 bytes in 55 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 1)
      [2019-08-29 12:31:59,132] ERROR in app: Exception on /job [POST]
      [pid: 22|app: 0|req: 185378/314428] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:31:57 2019] POST /job => generated 37 bytes in 1344 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
      [pid: 22|app: 0|req: 185380/314432] 10.20.0.1 () {42 vars in 552 bytes} [Thu Aug 29 12:36:31 2019] POST /auth => generated 188 bytes in 56 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
      [pid: 22|app: 0|req: 185381/314433] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:36:31 2019] POST /job => generated 151 bytes in 1497 msecs (HTTP/1.1 202) 2 headers in 78 bytes (1 switches on core 1)
      

        Attachments

          Activity

          Hide
          krzys Krzysztof Findeisen added a comment - - edited

          On all three failed runs, Jenkins re-runs ap_verify and re-uploads the (new) results. Some runs have multiple copies of data in SQuaSH (most obvious with ap_association.totalUnassociatedDiaObjects, which should have exactly one value per run), while others do not.

          All three runs upload ap_verify..verify.json (which contains totalUnassociatedDiaObjects) first, then try but fail to upload ap_verify.date2015-02-18_filterg_visit411371_ccdnum56_objectBlind15A_42.verify.json, so it's not clear why the duplication is inconsistent. Figuring that out might help track the exception down.

          Show
          krzys Krzysztof Findeisen added a comment - - edited On all three failed runs, Jenkins re-runs ap_verify and re-uploads the (new) results. Some runs have multiple copies of data in SQuaSH (most obvious with ap_association.totalUnassociatedDiaObjects , which should have exactly one value per run), while others do not. All three runs upload ap_verify..verify.json (which contains totalUnassociatedDiaObjects ) first, then try but fail to upload ap_verify.date2015-02-18_filterg_visit411371_ccdnum56_objectBlind15A_42.verify.json , so it's not clear why the duplication is inconsistent. Figuring that out might help track the exception down.
          Hide
          afausti Angelo Fausti added a comment -

          Another situation where a 500 error occurs is during the pod initialization. Sometimes the cloud sql proxy container is not ready when the API container starts. In this case it just needs to retry.

          $ kubectl logs squash-restful-api-7c65bcb577-dz2zb -c api -nsquash-prod | more
          [uWSGI] getting INI configuration from uwsgi.ini
          *** Starting uWSGI 2.0.18 (64bit) on [Sat Aug  1 00:59:52 2020] ***
          compiled with version: 6.3.0 20170516 on 29 March 2019 15:15:08
          os: Linux-4.19.112+ #1 SMP Tue Apr 28 06:48:29 PDT 2020
          nodename: squash-restful-api-7c65bcb577-dz2zb
          machine: x86_64
          clock source: unix
          detected number of CPU cores: 4
          current working directory: /opt/squash
          detected binary path: /usr/local/bin/uwsgi
          !!! no internal routing support, rebuild with pcre support !!!
          *** WARNING: you are running uWSGI without its master process manager ***
          your memory page size is 4096 bytes
          detected max file descriptor number: 1048576
          lock engine: pthread robust mutexes
          thunder lock: disabled (you can enable it with --thunder-lock)
          uwsgi socket 0 bound to TCP address 127.0.0.1:5000 fd 3
          Python version: 3.6.8 (default, Mar 27 2019, 08:53:45)  [GCC 6.3.0 20170516]
          Python main interpreter initialized at 0x55f8d3f9aa20
          python threads support enabled
          your server socket listen backlog is limited to 100 connections
          your mercy for graceful operations on workers is 60 seconds
          mapped 333376 bytes (325 KB) for 8 cores
          *** Operational MODE: preforking+threaded ***
          Traceback (most recent call last):
            File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 583, in connect
              **kwargs)
            File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection
              raise err
            File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection
              sock.connect(sa)
          ConnectionRefusedError: [Errno 111] Connection refused
           
          During handling of the above exception, another exception occurred:
           
          Traceback (most recent call last):
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
              return fn()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
              return _ConnectionFairy._checkout(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
              fairy = _ConnectionRecord.checkout(pool)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
              rec = pool._do_get()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
              self._dec_overflow()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
              compat.reraise(exc_type, exc_value, exc_tb)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
              raise value
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
              return self._create_connection()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
              return _ConnectionRecord(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
              self.__connect(first_connect_check=True)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
              connection = pool._invoke_creator(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
              return dialect.connect(*cargs, **cparams)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
              return self.dbapi.connect(*cargs, **cparams)
            File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
              return Connection(*args, **kwargs)
            File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
              self.connect()
            File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
              raise exc
          pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
           
          The above exception was the direct cause of the following exception:
           
          Traceback (most recent call last):
            File "run.py", line 48, in <module>
              db.create_all()
            File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 963, in create_all
              self._execute_for_all_tables(app, bind, 'create_all')
            File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 955, in _execute_for_all_tables
              op(bind=self.get_engine(app, bind), **extra)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4287, in create_all
              ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2032, in _run_visitor
              with self._optional_conn_ctx_manager(connection) as conn:
            File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
              return next(self.gen)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2024, in _optional_conn_ctx_manager
              with self._contextual_connect() as conn:
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2226, in _contextual_connect
              self._wrap_pool_connect(self.pool.connect, None),
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2266, in _wrap_pool_connect
              e, dialect, self
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1536, in _handle_dbapi_exception_noconnection
              util.raise_from_cause(sqlalchemy_exception, exc_info)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
              reraise(type(exception), exception, tb=exc_tb, cause=cause)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
              raise value.with_traceback(tb)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
              return fn()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
              return _ConnectionFairy._checkout(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
              fairy = _ConnectionRecord.checkout(pool)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
              rec = pool._do_get()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
              self._dec_overflow()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
              compat.reraise(exc_type, exc_value, exc_tb)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
              raise value
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
              return self._create_connection()
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
              return _ConnectionRecord(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
              self.__connect(first_connect_check=True)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
              connection = pool._invoke_creator(self)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
              return dialect.connect(*cargs, **cparams)
            File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
              return self.dbapi.connect(*cargs, **cparams)
            File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
              return Connection(*args, **kwargs)
            File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
              self.connect()
            File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
              raise exc
          sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
          (Background on this error at: http://sqlalche.me/e/e3q8)
          
          

          Show
          afausti Angelo Fausti added a comment - Another situation where a 500 error occurs is during the pod initialization. Sometimes the cloud sql proxy container is not ready when the API container starts. In this case it just needs to retry. $ kubectl logs squash-restful-api-7c65bcb577-dz2zb -c api -nsquash-prod | more [uWSGI] getting INI configuration from uwsgi.ini *** Starting uWSGI 2.0.18 (64bit) on [Sat Aug 1 00:59:52 2020] *** compiled with version: 6.3.0 20170516 on 29 March 2019 15:15:08 os: Linux-4.19.112+ #1 SMP Tue Apr 28 06:48:29 PDT 2020 nodename: squash-restful-api-7c65bcb577-dz2zb machine: x86_64 clock source: unix detected number of CPU cores: 4 current working directory: /opt/squash detected binary path: /usr/local/bin/uwsgi !!! no internal routing support, rebuild with pcre support !!! *** WARNING: you are running uWSGI without its master process manager *** your memory page size is 4096 bytes detected max file descriptor number: 1048576 lock engine: pthread robust mutexes thunder lock: disabled (you can enable it with --thunder-lock) uwsgi socket 0 bound to TCP address 127.0.0.1:5000 fd 3 Python version: 3.6.8 (default, Mar 27 2019, 08:53:45) [GCC 6.3.0 20170516] Python main interpreter initialized at 0x55f8d3f9aa20 python threads support enabled your server socket listen backlog is limited to 100 connections your mercy for graceful operations on workers is 60 seconds mapped 333376 bytes (325 KB) for 8 cores *** Operational MODE: preforking+threaded *** Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 583, in connect **kwargs) File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection raise err File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused   During handling of the above exception, another exception occurred:   Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect return fn() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect return _ConnectionFairy._checkout(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout rec = pool._do_get() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get self._dec_overflow() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__ compat.reraise(exc_type, exc_value, exc_tb) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise raise value File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get return self._create_connection() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection return _ConnectionRecord(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__ self.__connect(first_connect_check=True) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect connection = pool._invoke_creator(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect return dialect.connect(*cargs, **cparams) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect return self.dbapi.connect(*cargs, **cparams) File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect return Connection(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__ self.connect() File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect raise exc pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")   The above exception was the direct cause of the following exception:   Traceback (most recent call last): File "run.py", line 48, in <module> db.create_all() File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 963, in create_all self._execute_for_all_tables(app, bind, 'create_all') File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 955, in _execute_for_all_tables op(bind=self.get_engine(app, bind), **extra) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4287, in create_all ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2032, in _run_visitor with self._optional_conn_ctx_manager(connection) as conn: File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2024, in _optional_conn_ctx_manager with self._contextual_connect() as conn: File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2226, in _contextual_connect self._wrap_pool_connect(self.pool.connect, None), File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2266, in _wrap_pool_connect e, dialect, self File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1536, in _handle_dbapi_exception_noconnection util.raise_from_cause(sqlalchemy_exception, exc_info) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect return fn() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect return _ConnectionFairy._checkout(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout rec = pool._do_get() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get self._dec_overflow() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__ compat.reraise(exc_type, exc_value, exc_tb) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise raise value File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get return self._create_connection() File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection return _ConnectionRecord(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__ self.__connect(first_connect_check=True) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect connection = pool._invoke_creator(self) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect return dialect.connect(*cargs, **cparams) File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect return self.dbapi.connect(*cargs, **cparams) File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect return Connection(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__ self.connect() File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect raise exc sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)") (Background on this error at: http://sqlalche.me/e/e3q8)

            People

            Assignee:
            afausti Angelo Fausti
            Reporter:
            afausti Angelo Fausti
            Watchers:
            Angelo Fausti, Krzysztof Findeisen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:

                Jenkins

                No builds found.